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Numerical Techniques 
Henry P. Decell, Jr. 

General University of Houston 

In the digital calculations that drive the classification portion of the 
Hybrid Pattern Recognition System (HYMPS) there are three items that warrant 
"tuning". They are: 

I. Matrix Inversion 

II. Det calculations & Singularity 

III. Covariance Factorization 

Although I, II, III are essentially viewed separately in the HYMPS writeup, 
they are, in fact, related and some improvement in numerical accuracy and 
computational speed can be gained by simply deleting redundant matrix manulipations. 


Introduction 

In what follows we will show that it is more economical to first factor 
the covariance matrix and, by so doing, delete the matrix inversion MINV. 
Necessary Det calculations can, moreover, be more easily realized by use of 
simple theoretical facts about the factorization. 

Basically the reasons for doing the factorization first are: 

1. MINV (or any other inversion routine, for that matter) can be 
eliminated in the current calculations 

2. When MINV is deleted, errors in computation will be directly 
related to the factor routine and not to a combination of inversion 
factorization (unknown) errors. 
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All required information for classification is contained in the 
factorization. 

4. The upper triangular form required in the analog classification 
scheme is preserved in these operations. 

5. The B, matrix calculations in their present form are no longer 

k 

required. 

Factorization 

We recommend that the covariance matrix be facftored into upper 
triangular form" i.e. 



where 


A is a matrix with all zeros below the main diagonal (this is now being done 
to t 1 in HYMPS, after applying MINV to TL ) . The results of this 


-1 


factorization will be as good as those obtained in factoring since we 

propose that the same factorization routine be utilized. In fact, the conditioning 
°f would produce factorization error since ^ may well be garbage. 


Now if = AA then 


1 = (A 1 ) T A~ 1 and 


.-1 


since A is upper triangular so is A 

We wish to compute the value of the classifier 


(2*rli! 


\ rr exp - i(X-X) 1 (X-X) . 

3 i5i n> 2 


f(x) = 




3 


If we let Y = A -1 (X-X) then it is easy to see that the exponent Q is 


Q = -j {A" 1 (X-X)} T {A _ 1 (X-X)} 


1 T 

y l y = 


-it, 

1=1 


Hence if Y = A -1 X then AY = X and since A is upper triangular we 
can write the recursion formula for the as follows. We do it in general, 

however, for liYMPS M=6 


AY = X - X 

mm m mm 


A . ,Y , + A , Y « X . 
m-lm-1 m-1 m-lm m m-1 


- X 


m-1 


A m-2m-2 Y m-2 + A m-2m-l Y m-l + A m-2m Y m X m-2 X m-2 


A 11 Y 1 + A 12 Y 2 + 


+ A 1 ¥ 1 + A. Y = X. - X.. 

lm-1 m-1 1mm 1 1 


In another form 


Y = X - X , 
m m m/ 


mm 


m-1 


X . - X . 

m-1 m-1 

A i i 
m-lm-1 


m-lm 


A m-lm-l A m-lm-l 


{X . — X . - A - Y } 
m-1 m-1 m-lm m 


Y m-2 A „ ^ X m-2 X m-2 A m-2m Y m A m-2m-l Y m-l^ 


m- 2m- 2 


¥ 1-5^ {X 1- X l- (A 12 Y 2 +> * ,+ A lmV ) 


I 



In general, 


k 


m-k A 


m-km-k 


(X . - X , ) — 
m-k m-k 


j=l A m-k,m-(j-l) Y m-(j-l) 


Det Calculations 

Since 4« = AA it follows that: 

det^ = det(AA T ) = (det A) (det A T ) 

= (det A) (det A) 

= (det A) ^ 

Since A is upper triangular, its eigenvalues are the diagonal elements of A 

Moreover, the det of any matrix is the product of its eigenvalues so that 

m 

det A = A. . 

i=l 11 


Hence 


det it. = (det A) = 


N 


• 4 

m 

det^ = Tjj 


A. . , 

l-l li 


an easy by-product of the factorization independent of MINV. 

Singularity Evaluation 

In the divergence calculations one should avoid concluding that is 

"near singular" if det i = 0. 



This is a classical misunderstanding of the theorem which states: 


II 



is singular if and only if 


det 2 .= 0" 


The misunderstanding arises by assuming a similar (however meaningless) theorem, 
namely, 


II 



is near singular if and only if 


detX= 0" 


The fact of the matter is that there does not exist a concept of "near 
singular" in matrix theory. The term "near singular" applies to numerical 
difficulties one may encounter in inverting matrices and is in no way related 
to whether or not 2L is in fact singular. 


Consider the example (3 x 3) 



det A = (l(f 6 ) 3 = 10“ 18 = 0 


Yet A is neither singular nor numerically difficult to invert. 

Note: The fact that the example is "upper triangular is of no particular 

consequence except that det A is easy to calculate by inspection. 

In fact for this A 

2^ = AA is symmetric and positive definite (See page 6-7) 
yet aet2> = 10 38 — 0. 
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Abstract 

This paper briefly reviews and improves upon classical iterative 
methods in nonlinear regression. This is accomplished by discussion of 
the geometrical and theoretical motivation for introducing modifications 
using generalized matrix inversion, other than but in the same general 
vein as those discussed by Fletcher [6], Examples having inherent pitfalls 
described in [8], [12] and others are presented and compared in terms of 
results obtained using classical and modified techniques. The modification 
is shown to be useful alone or in conjunction with other modifications 
appearing in the literature. 



Introduction 


Following for convenience the notation of [8], let y denote a 
set of n responces of the form 


y t = f t (0) + e t ’ t = l,...,n 


where the response function f (0) is a known function of t and an 

A 

undetermined vector 0 = (0,,...,0 ). We will call the vector 0 a least- 

1* p 

A 

squares estimate (given the n responses) of 0 provided 0 minimizes 


Q(0) = l (y t - f t (0)> . 

t=i 

The vectors are defined 


and the matrices 


0-/ fll - 3(Q(9>> 

Q (0) IQ— 


i 

R(9) = (y t - f t (0) > 


f'(0) 


= ( 


3 0 


3 (f t (0) ) 


9Q(6) n 
30 . ; 

__J 

30. 


i 


T 

) 


Q”(0) » 
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Three of the most common differential correction schemes for 

A 

estimating the parameter vector 8 are the steepest descent method, the 
quadratic approximation, and the Gauss-Newton method, with corrections 
respectively given by 

A0 = -aQ' (8) , a > 0 

A0 = — (Q ’ ’ (0)) -1 Q’ (8) 

A0 = -l/2(f , (8) T f’(6))~ 1 Q , (0) . 

These methods have their advantages and disadvantages. Of the 
three, the Gauss-Newton method is probably most popular. 

The authors of [8] present a modification of a classical method and 
state that "The step A0 will in general be distinct in both length and 
direction for each of the three methods." This is not necessarily the case 
from a computational point of view since the matrices to be inverted may be, 
for all practical computational purposes, singular; yet the system of 
equations may have infinitely many solutions. For example, the Gauss- 
Newton correction requires the solution of the equation 

f ' (0) T f ' (8)A0 = f ' (0) T R(0) 

-1/2Q* (0) = f’(e) T R(0) . 


since- 



It is known that any equation of this form (i.e., of the form 
T T 

A Ax - A z, the normal equations of the least-squares problem: minimize 

T 

(Ax-z) (Ax-z) given A and z) always has at least one solution and 
perhaps infinitely many. We will try to point out the significance and 
consequences of these solutions in terms of their relationship to 
differential correction schemes. 

The Generalized Inverse 

A few basic concepts regarding generalized inverses important to 
the development follow. 

Theorem 1 . The four equations AXA = A, XAX = X, (AX)* = AX, and 
(XA)* = XA have a unique solution X for each complex m><n matrix A. 
This solution X is called the generalized inverse of A and is denoted 
by X = A + . 

This theorem is due to Penrose [10] and is equivalent to the 
apparently more geometric characterization of the generalized inverse 
of A which follows. 

Theorem 2 . The generalized inverse A + of A is the unique solution 


of the equations 



where and > respectively, denote the perpendicular 

projection operators on the range spaces (column spaces) of A and X. 

In any case, it is easy to see that if A is square and non- 
singular, then A + is the ordinary inverse of A. Much work has been 
done recently in the area of generalized matrix inversion, including 
theoretical developments and computational techniques, rendering it a 
very useful tool in matrix theory and applications. A rather exhaustive 
bibliography concerning applications of generalized inverses can be found 
in [2], [3], and [13]. We will not develop the details of the basic 
concepts, but rather state an important theorem regarding the solution of 
matrix equations in general. 

Theorem 3 . The matrix equation AXB = C has a solution X if and only 
if AA C3 B = C, in which case all solutions are given by 

X = A + CB + + S - A”ASBB + 

where S is an arbitrary matrix having the dimensions of X. 

T T 

The Equation A Ax = A z 

As stated earlier, the Gauss-Newton method involves the solution 

of an equation of this type at each iteration. The following corollary 

to Theorem 3 will give some insight to a possible course of action one 

could take at those times during the iteration process when the matrix 
T 

f'(6) f'(G) (or perhaps even a matrix such as Q' ’ (0) in another method 




requiring inversion for the calculation of the -orrection £6) is 

actually or nearly singular. For the purpose of this paper, we will 

describe how generalized inversion can be useful in iterative techniques 

T T 

requiring the solution or equations of the fora A Ax = A z. 

Corollary 1 . If A is any m x n matrix and z is any m x l vector, then 
T T 

the equation A Ax = A z has at least one solution and all solutions are 
given by 

X = A + z + (I - A + A)y 

where y is arbitrary having the dimensions of x. 

The proof of Corollary 1 is an immediate consequence of Theorem 3 
and fact that (A T A) + A T = A + [10] . 

T T “1“ 

Corollary 2 . Among the solutions of A Ax = A z, the solution x = A z 
has the smallest Euclidean norm (henceforth "norm" will be denoted | | • | | ) . 

The proof of Corollary 2 follows from the facts that I - A + A is 
the orthogonal projection operator on the orthogonal compliment of the 
range space of A + and hence that A + z and (I - A + A)y are orthogonal 
for every y. In fact, 

i |A + z + (I - A‘A)y| j 2 = | |A + z| | 2 + j | (I - A + A)yj | 2 


> 


2 
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The significance of Corollary 1 is that there may be infinitely 
many possible corrections A0 satisfying an equation defining a 
differential correction scheme in the presence of a singular or, in the 
computational sense, nearly singular coefficient matrix. There is a 
tendency to disregard or remain unaware of these solutions and, with the 
inability to invert the coefficient matrix, - to look for new or modified 
techniques such as those found in [1], [5], [8], [9], and [12]. For 
example, in [7] Jennrich and Sampson modify the coefficient matrix by 
selected rows and columns. In [8], Marquardt changes the diagonal of the 
coefficient matrix. It has been our experience that these solutions should 
be given careful attention in the case of what will hereafter be called an 
apparent (i.e., actual or computational) singularity. 

Fletcher [6] points out that in the generalized least-squares 
(Gauss-Newton) or Newton methods "... A most important property of the 
generalized inverse formulation is that in all circumstances (i.e., full 
rank or not), even when the generalized least-squares method would fail, 
the directions of search generated are downhill and so an imporvement can 
always be made to the sum of squares (assuming the approximation is not 
already a stationary point)." In this connection, the significance of 
Corollary 2 is that there is a reasonable way to choose a correction A0 
satisfying the defining equations of the scheme whenever an apparent 
singularity occurs. We propose to choose the minimum Euclidean norm 
correction A + z (i.e., the correction of shortest length consistent with 
the correction equation) . It has been our experience that in nonlinear 



equations other solutions can result in failure of convergence. 


The suggested correction certainly depends upon the algorithm used 

4 . 

to calculate A' and the actual computational way in which the algorithm 

T 

establishes that A is not of full rank (i.e., A A singular). Of course, 
this is intimately connected with near-zero tests in the algorithm, 
sensitivity to dependent columns or rows, conditioning, and so forth. We 
should further point out that, for a general differential correction scheme 
of the form M(0)A0 = z(0), the choice of the correction should be 
A0 = M(0) + z(0) if there is at least one solution for A0. Of course, 
according to Theorem 3 there will be at least one and possibly infinitely 
many solutions A0 if and only if M(0)M(0) + z (0) = z(0). Moreover, if 
there is one and only one solution, then that solution is indeed given 
by A0 = M(0) + z(0) . 

T 

For example, in the Gauss-Newton method, M(0) = f'(0) f'(6) and 

z(6) = f t (0) T R(0) so that A3 = M(0) + z(0) = (f * (0) T f ' (0)) + f ' (6) T R(0) = 

f'(0) + R(0). Even if M(0) is nonsingular, then (f 1 (0) T f ' (0) ) + = 

T -1 

f ' (6) f'(0)) , and either form of A0 may be used in calculations: 

A0 = (f ' (0) T f ' (0)) -1 f ’ (0) T R(0) = f’(0) + R(0) . 

In other words if M(0) is square and computationally nonsingular, the 
classical correction is, in fact, the minimum norm correction. We will 
not discuss the comparative aspects of computing A0 in a correction 
scheme such as the Gauss-Newton method by one or the other of the 



theoretically equivalent formulas: 


(1) a© = (f * (0) T f ' (0)) + f ' ( e ) t r ( 0 ) 

(2) A0 = f*(0) + R(0) 

Calculations in our examples use (2). 

We have had unusual success with this technique in many practical 
problems too numerous to mention here. In many cases, one definite 
advantage seems to be the ability to continue making corrections of 
reasonable length and'perhaps, as in the Gauss-Newton case, reasonable 
direction through regions in which the coefficient matrix M(Q) behaves 
baaly. We do not propose this technique as a cure-all but rather that it 
should be included among other useful techniques in nonlinear regression. 

A few examples having known pitfalls will be presented in the next section. 

Examples . 

In the following examples, the residual sum of squares Q (©) will 
be presented in tables by iteration number. The values of Q(0) for the 
methods cited will be those values tabulated in the references cited. Some 
authors divide Q (0) by the degrees of freedom. For clarity and easy 
comparison we indicate this division in the tables when necessary. Finally, 
the residual sum of squares given by the method of this paper (minimum norm 
correction) will be noted MN; Q(0). 
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Results of the method of this paper compared with those of the 
Modified Davidon Method (MDM) used in [12] to find the parameters of an 
exponential model discussed by Hartley in [7] are given in Table 1. 


Table 1 

Exponential Model (Hartley) 





Iteration 

MN; Q ( 6) ] 

MDM; Q (6) 

| 0 

27376 

27376 ! 

i 

; i 

! 

14586 

20127 ; 

2 

13779 

15412 

3 

13408 

| 

13552 ; 

I 

i 4 

133S4 

13485 

5 

! 

13390 

! 

. 13449 j 

6 


1 

13425 ■ 

7 

• 

13394 

8 


i 

13393 j 

9 


13390 
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A second exponential model given by the authors of [8] points out 
a failure of Hartley's method [7] due to a singular partial derivative 
matrix. In [8] a stepwise regression scheme (SR) is successfully utilized 
for this example. The results of the (SR) scheme compared with those of 
the method of this paper are given in Table 2. 

Table 2 

Exponential Model - Singular Partials 


I ' 1 

Iteration 

i 

MN; Q(0)/8 

SR; Q ( 0) / 8 

! ' 0 

1 

521.41 

521.41 

j 1 

429.84 

429.84 

2 

39.11 

88.15 

i 

3 

15.765 

83.74 

j 

1 4 

15.545 

* 

10 


21.33 

30 



15.545 


*The value of SR: Q(8) was not tabulated in [8] 

for this iteration. 

Another six-parameter exponential model having inherent singularity 
problems is presented in [12] using the Modified Davidon Method (MDM) . 

A comparison of the results using the technique of this paper is given 






in Table 3. 


Table 3 

Six Parameter Exponential Model - Singular Partials 


Iteration 

MN; Q(8) 

MDM; Q(0) 

0 

21.38 

21.38 

10 

.873 

2.39 

20 

.792 

1.99 

30 

.396 

1.77 

40 


1.59 

50 


j 

1.41 

60 


.90 

70 


.41 

80 


.407 


Concluding Remarks 

We have taken the liberty to exclude a reproduction of the detailed 
description of our example models. These models are thoroughly treated in 
[7], [8] and [12]. The tables give some indication of rates of convergence 
and a comparison of residuals only. We do not wish to leave the impression 



that iteration counts are comparable. For example, one Gauss-Newton 
iteration could have been equivalent to p conjugate direction steps 
for the matrix inversion employing the Davidon method. 
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Problem Statement : Let /T^, ^ * • • • , be n distinct, normally distributed 

classes or populations of two dimensional response vectors x a ( ^ where 


A i 


; is a measurement of the relative response of x along channel i. 

4 


cu 





channel i ^ 

The problem is to determine the "best channel" in the sense of divergence and 

in the sense of minimizing the probability of m i sc lass? f ication 0 

denote the sample covariance matrix for the i th class and suppose, 

after training, we find that * I, i = 1,2, „ . . , n. Let/i f be the 

mean associated with the i th population; then it is easily verified that 

the interclass divergence is: 

0(f » j) - (U - Li.) 2 > 0 

J 

The density function for the I th class Is: 


P;(*) 


I 


V27 r 


r -Ai)‘ 


It is useful at this time to consider the partitions of a given channel axis 
determined by the maximum likelihood solution of the Bayes discriminate problem. 
Reca II in this case x i s assigned to if if: 

In P k (x) a max f In P. (x), i = 1,2,,. . „,n f 
Under our assumptions that 2 . - |, it is easily verified this becomes x ?s 


assigned to ''iT ^ if 1 


( x " , u k ) “ min [ (x - ^ . ) , i = 1,2, 

, n - I. 


n J 


We shall assume U. ^ U for i » I, 

1 i + l 

Now note 


O • • 


( * ■/' ki )S = ‘ “D * 

- (x -u t f * 20*, - « U| )(X -/<.) ♦ (u t -Cl 1+| )‘ 



2 2 

so that (x - /./ ) > (x - /<■ ) whenever 2{u - u ) (x - u. ) ♦ ( n m - //. 

' 1 * i i i 

that is, whenever x - ^ ; + j)° 

Thus associated with each mean U , we have a region R such that if x £ R „ x 

i I i 

Ts assigned to the 71 . population where the R.*s are defined as follows: 

R , • f x* X ^ ±(u J + u 2 ) } 

. ® ^ x: u *- u { ) £ x £ 5 -( a j + a . + j )j[ i -2, • • • , n-I 

R n “ l xt ^ n -l + “ n >' x i 
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ii - / " * n 

Now consider the n—class problem with equal a priori probabi I i ties *■ — 

cost of m i sc lessi f y I ng an individual from population 71 . as being from population 

^ |j) ** l» anc i the probability that x belong to R^ given that the individua 

is from 71 P(j|i,R) =- I P. (x)dx , then the cost of misclassif ication to be 
' J R . 1 

expected totally is: J 


1 L 

n i=*J 





n ( n 

Q(R) = i?j q l ( j'j C 0|OP(j|i,R) 

& ' r 00 

L ! , *--i? i. 


j/! R j 


P. dx 


For i - I, /_ \ P.dx « \ P. dx = , , 

j-2 Rj ' Jn 1 '/iTJ/.. x 

J UR. 2vW, + i^) 

j-2 J ' 1 ^ 


" 00 




- b 2 

e dy 


'"CO 


k 


2* Jo 


- Tt . 
e dy 


'■* ( A>-V 


7 2F0 


-b 2 d 

e dy 


f^/D(2,l) 




L.2 


— gy ^ _ -~—yj 

e dy since by symmetry of e 2y 


L.2 


r °° 


we have ^ }_ 


r oo= 


00 


e- ^ dy - 2 -k 


0 


i 2 

e ^ dy =* I. 



{2U pa ge 3 


When i 


When I 


n we again use the symmetry of the function so that 

n ”‘ ' ( f ^/nVn-l 5 


. , P. dx * 

j =| R j 1 


jn-1 P. dx * — L i 

jU, Rj * / 2fJ- 


»“ 2(* “ a n) 


dx 


00 


^/n-l^n^ 


[2 Vj _ 


00 




dy 


"00 


*2 VMu-U ,) 

’ n n- 1 ' 


e' » dy 


I ^ 5-\fcT(n,n-I ) 2 

sT 1/2/7 ) e ^ dy 

■' 0 


< i < n 


i-I 


• ) R P t dX “ 

J=l R j 1 

j/' 



J=l R j 


P. dx + 



P. dx 

j-i + j'Rj 1 


/i-l P. dx + ^ 

Ur' n 

i=« j . U R. 

J 1 J j=i+i J 


Pj dx 


jf 2 (> j ) _ ) C - £(x-,W ) 

72 if! e dx +- J e 1 

l -® = ( V?W 


dx 


1 

m M fl 


^ j./ 

2 ( 'J 


'"i-fA ^ 


'-00 


00 

e" 57 dy + J e” 27 dy 


7** 


00 


2 ( “ ( j “ lf j _ | ) 


i 2 

- 2Y 

e dy + 


00 


s(;- U i -*.) 


-iy 2 

e dy 



I - 


I r^/od.I-lj _ . 2 C^Id( i+l,i) i2 


»/2tr 


J 0 


e ^ dy • 


- 5Y 


>0 


dy 


Finally we have the total cost of mlsclassif ication is 


9<R> - - 


k - 


fejf > 0 


^0(2,1 ) _ ,2 


n-I 


e“ ^ dy + J_(l - 


f 2^0|i"l ) _ Xy2 


i=2 


e ^ dy 


0 


^^ lD(i+ ' # l) e^ dy ) ♦ ± - J 


( k'\ D(n,n-lj - ^y 2 I 
e dy j 

[2U JO 


(n - n -0 


L 

i “2 J o 


2-/o(i ,»-I) 


n-I / 2 


e dy + / 

i=i J o 


^•Vd (!'♦!,!) 


- kr 

e dy 


So Q(R) =* n 


2 n-I /' i/oO + l.l ) ( 2 1 

, (n ‘ £j' 0 e 57 dy 


(I) 


Thus note that Q(r), the total cost of m isc lassi f i cat i on ? does not depend 

on D(i»j) for j ^ l-l,i+l# But recall the definition (or perhaps criteria) of total 

intercless divergence, namely, 
n n 

0 . 1- L 0(1, J) 

1=1 j-; 

j/t 


(2) 


I believe equations (I) and (2) express the main problem with the existing 

feature selection — classification scheme, namely that the feature selection 

criteria (2) is inconsistent with the classification criteria (|) 0 This paper 

has shown that when Z.. - I for all i, and u < ■ /( , a "better" feature 

1 i l+l 

selection criteria would be to desire 0 large where in this case D £ J>(n - I) and 


1 ^ ( ^ f D(UI,T) _ ^ 


°-w L 


i»i o 


dy 



~ I Z' 3 . iy2 | f 00 1.2 

The "nice" property about D is that _L e dy » .1+987 and — e 57 

'Jzt o l^ J o 

so that there is no need to worry about D(!,j) becoming too large. 

Finally, we consider a numerical example with all covariances equal to I and n « 3 
Assume the means along the channel I axis are given by 


dy 


/ li \ * - 6, 


, U 2 “ °* / l = 6 


0(l*2) + d( 1,3) ♦ D(2,3) “ (“6) 2 + (-12) + (6) ■ 216, and 


and the means along the channel 2 axis are given by 

it \ * 0, ,U 2 « I, “ 12 . 

i 

Then D 

°|channe| 2 = ^ + ^“ ,2 ) + (-II) 2 » 266. Thus the divergence criteria would imply 

selection of channel 2, 

But, the total probability of mi sclassi f ica tion is given by 

Q(R) 


channel I 3 L /2t< ^0 




{i* 'o 


^ Tl 0 /2; r 0 


dy 


(I - .1+987 - .1*987) 


dy 


.0017 


and 


Q(R) 


channel 2 . ■ - 

3 


2 - 


-if* 




&0 


dy - 


I/2W0 


- 5y 2 

e ^ dy 




- (I - .1915 - .5000) 
3 

.2056 


Since the probability of mi sclassi f ication is much less by this criteria the 
choice would be channel l 0 A pictorial representation is given below. 
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DIVERGENCE CONSIDERATIONS II 


by 

J. A. Quirein 

The interclass divergence D(i,j) in a sense, a measure of the 
"separability" of two classes IL and IL . The problem of determining a 
function F of the interclass divergence over all possible combinations of 
a fixed number of channels such that maximizing F will minimize the 
probability of misclassif ication (for that number of channels) has not yet 
been solved. 

Consider the case of three distinct classes 11^, II^, II^. One such 
function of the divergence typically constructed is of the form: 

F = D (1 ,2) + D(1 ,3) + D(2 ,3) . 

It has been previously shown that maximizing F need not necessarily mini- 
mize the probability of misclassif ication. A second commonly constructed 
function of the divergence is the following 

F = min (D (1,2) , D(l,3), D(2,3)). 

To show how maximizing F does not necessarily minimize the probability of 
misclassif ication, let the means along the channel 1 axis be given by 

y l = °» ^2 = 2 ’ 2, y 3 = 5,2 



and the means along the channel 2 axis be given by 


b 1 = o, e 2 = 2, e 3 = s 


then 


F 


channel 


1 


min (4.84, 27.04, 9) 


4.84 


1 channel 2 


= min (4, 64, 36) = 4 


and maximizing F implies selecting channel 1. The probability of mis 
classification is verified to be 


Q(R) I channel 1 " * 135 
Q ^ R) I channel 2 = * 107 


which indicates in this case, the "best" choice would be channel 2. 
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PATTERN RECOGNITION AND THE POTENTIAL FUNCTION 


Supposing that we have two sets A and B which do not intersect in a space^M-^ 4 *^ 
, then there exists at least one separation function lj- (x) for which If' (X) /* 0 
if X 6 A and 2/'(x) <0 If X 6 B. The idea of the potential function is to 
build, by an iterative process with a finite number of known points from A and B, 
a sequence of functions K r (x) which tend to one of these separation functions 
as r increases* 

Assume that in there is a linearly independent system of functions, 

l0 (x), a subset of a complete system, such that for any two separable (always 
J ' N 

taken hereafter to mean in the geometric sense) sets in Jc , l/.' (x) = Z- c i jr .(X) 

i=>l 1 

separates these two sets, N depending on the sets to be separated* In order to 

have convergence in probability let the ^.(x)*s be an orthogonal or orthonormal 

system* Additionally if K(X,Y), the potential function, is bounded on AUB and the 

function (X) rigorously separates A and B(i*e. \\ >< 6 B where o) , 

it can be proved that there is an integer m, independent of the teaching sequence 

so that the number of errors corrected does not exceed m 0 If the appearances of 

the points in the teaching sequence are independent events and at any rth step 

there is a strictly positive probability of correcting an error if separation 

of the sets h«$ not yet occurred, then the probability is unity that the separation 

of the sets will be realized in a finite number of steps. If we agree to terminate 

the teaching process as soon as no error has occurred in L examples in the sequence 

following an error correction (L, an arbitrary prescribed integer) then the entire 

teaching sequence will be terminated in Lm steps. Let P be the probability of 

error in the process after termination of teaching and 6>0, then it can 

. In ( ^/m) 

be proved that the probability that P<6 exceeds I -5> if L satisfies L> * 

In (I -6) 



ALGORITHM 


The construction of a separation function If- (x) shall be accomplished as 
fol lows: 

Let the potential function be defined by: 

N 

K(X,Y) = L A 2 (X) &' (Y) 
i=! * ! 1 I 

and let A be the positive set and B, the negative one. 

For Kj (x) we will take: 

( K(x,x.) if X t A 

K.(X) 3 ) 1 1 

i - K(X,Xj ) if X ( t B 

Inductively we proceed after the rth step, inwwhich the function K (x) was 

— r 

constructed. Compute K (x , )• If either K (x^ , ) > 0 and X 6 A or K (x )< 0 

r r+I r r *i r +| r r + 1 

and X B (i.e, the function K (x) agrees at the point X with our oriqinal 

r+ I r r+| 

convention of A, positive and B, negative), we shall set K ^ (x) * K (x) and proceed 
to the next point X^. If K r (X r+| ) > 0 and X r+| 6 B, set K.^(x) - K (X) - K(x,X r+| ). 
If K (x ) < 0 and X 6 A, set K ,(x) » K (X) + K(X,X , ). In either of the 
latter two cases the potential function is altered by addition to i t of the potential 
of the (r + I )st point with sign necessary to "correct" the function at this step. 


EXAMPLE 


For our space we choose X [-l,lj • Let A ® [ (x,y): - x,y £ ^ ^ 

and B »• ? (x,y): |-£ x,y£ ~ ( and, using the training points given in figure I, 

^ 0 M f C\ ? w C- A 


bu 


> 0 if Xfc A 


ild a separation function If (X) = £ c. 0 if x < B ' Since 1 and 

x + y are linearly independent and defined on using the Gram-Schmidt process 

we find for d,(x) 3 I and d (x) » x + y - I we have an orthogonal set of functions 

> d C I C i 

where the inner product is defined by ( f . (x), <^j(X)) * ) , j , f'; M • (*) dx d Y* 

Letting * I, K(x,X k ) * I + (x k + y k - !)(x + y - I) where X = (x,y) and X k * (x ,y ) 

* k 
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X -(10/16,10/16) 

x 9 «( 2/16, 2/16) 

X |7 -( 7/16, 7/16) 

x 2 “( b/\6, U/\S) 

x, 0 * ( 9/16, 8/l 6) 

x t8 -( 7/16,10/16) 

X 3 -( 8/16, 7/16) 

X, »< Vi 6, 2/16) 

x |9 ,«( 5/16, Vl6) 

X u -( 1/16, 5/16) 

X |2 -( 6/16. 8/16) 

x 20 = ( 7 / ,6 » 6/16) 

X 5 -( 1/16, 2/16) 

X |3 =»( Vl6, l/l6) 

X 2| V 5/16, 5/16) 

X 6 -( 8/16, 9/16) 

X |^= ( 3/16, 5/l6) 

x 22 “( 7/16, 8/16) 

X Y = ( 8/l6, 6/ 16) 

x — ■( 10 / 16 , 9 / 16 ) 

15 

x 23 =( 10 / 16 , 8/16) 

X 8 -( 2/16, 3/16) 

x |6 -( 3/16, 3/16) 

x^-( 3/16, 1/16) 


Figure I 
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Construction of K (X) will therefore always yield a line, moreover, a line whose 
r 

slope is negative one. Since A and B can be separated by such a line choice of 

N «* 2 wi I I yield a separating function as desired. 

By definition Kj(x) = - K(X,X,) - -I - J (x + y - I) since Xj 6 B. Figure 2 

shows K (X) in relation to A and B. Testing Xg in K (x) we find Kj(Xg) * - — < 0. 

Since X 6 A put K (X) » K (X) + K(X,Xg) - - f (x * Y “ 0* ,n f '9 ure 3 we see that 
2 d | 4 j 

lies below the line K 2 (x) * 0 and testing we find that K^X^) “ ^ ^ 0 and 
since X^6 B, K^X) ~ Kg(x) - K(X,X^) * -I - * Y ~ 0* Since K ? (X^) < 0 

and X,U (see figure h), K, (X) * K (x) + K(X,X^) * - ^ ( x + y - « )• Since 
K u (X 5 )> o and X ? e A, K 5 (X) * K^(X) - - ~ (x + y - l)o 
Continuing the process we finds 


k 6 (x) =■ k 5 (x) - K^(X) - - ^(x ♦ y - 0 
K ? (x) - (x + y- l) 


K fi (x) ° K 0 (x) ” K (X) * K (x) - - — (x ♦ y - I) 

° 9 toll 8 

K.o(x) => K (X) - -I - 7 (x * y - I) 

1 15 b o 

K |i+ (X) - K |5 (X) - K |6 (X) *^(xM-D 

k |? (x) * k |8 (x) * -I - j(x + y - i) 

K I9 (X) “ - I6 (x * y ’ ,} 

k 2o ( x) *«i- ! |(x + y- i) 

k 2,(x) - - y (x ♦ y - I) 

K 22 (x) 51 K 23 (X) “ K 2l + (X) = - 1 - ^(x + y- O 
Figure 5 shows the relationship of K. (X), i » i+,. • », 2it ; to the sets A and B. 

Taking f(X) = K^(x) - -I - ^ (x ♦ y - I ^ (x) - L c ] ^.(x) for 

c, a - | and c » - — , Testing the function, it does, indeed, separate the 

1 2 16 

training sample, for Ip (x) > 0 for all Xfe A and Ip (X) £ 0 for a 1 1 X 6 B* A 
geometric analysis of the sets shows any function of the form -I ♦ q(x + y - I) 


will separate i f 


5 q + | ^ U3 

— C M L < , and our q = - T 7 satisfies this inequali ty. 
8 q it 16 
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Although the training points of this example were purposefully "rigged” to 
insure that each part of the definition of K^(X) would be used and that convergence 
would be accomplished in the limited number of training points, in the latter 
case, if the training sample had run out without clear separation it could have 
been reused in the continued construction of the function. Certain points appear 
more critical to the process; in our example, those points in A nearest the 
shaded region in figure 6 are more sensitive to change and those points in B 
nearest to the line x * y - I => 0 and to its left produce more change as the 
algorithm progresses# However^ these remarks are pertinent to this example alone 
as alteration by so simple a change as choice of A would require completely different 

though analogous, comments. 

It was necessary to avoid any point X . for which K (x ) « 0 since 

' r+l r' r+l' 

the algorithm does not deal with this possibility (i.e. take X on the line 
x + y - I * 0 at alternating steps of the function construction beginning at r * 2), 

It would seem advisable to add to the algorithm "if K r (X r ^|) “ 0, let X r+ ^ become 
the (r + l)st point, discarding the original X r+I as a training point and 

ii 

renumbering the points. 

EVALUATION 

In [lOj, the purely geometric method of the potential function is compared 
with a structural approach, basically one of recognition of broad interclass 
sim i lari ti es, and it is the opinion stated in this paper that neither method is 
suitable to solve complex problems. In the case of the recognition of the letters 
of the alphabet photoelectric cells 1000x1000 may be needed for a clear picture, 
making the vector representation 1,000,000-tuples, which might produce a memory 
storage problem. In the development of the idea of a potential function for 
construction of a separating function any ofithonormal system of functions &' ‘ , (X)»s 



wi 1 1 produce convergence of the algorithm. It seems obvious that for some 
choices of the system convergence might be more rapid than for others. However, 
nowhere was there mention of how this choice might be made to minimize Hu In 
addition ijs (X) can be realized as a finite linear combination of the ^ (x)’s 

where the number N of the j- t (X)'s necessary depended on the sets involved. 

There was no discussion of the problem of how determination of an appropriate 
N, let alone a minimal one, could be made. 

This method does, however, have the advantage that convergence in probability 
is assured in a finite number of steps to any desired degree of reliability. The 
experiments made and reported bear out this result by the high percentage of 
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Introduction 


The purpose of this paper is twofold: 

(1) Introduce the concept of fuzzy set 
(Zadeh [1]) 

(2) Apply the concept of fuzzy set to pattern 
recognition (Wee [2]) 

We will consider only the ideas from fuzzy set theory that are 
directly related to pattern recognition. Our approach to pattern 
recognition will follow the PhD thesis of W. G. Wee. In this thesis 
an iterative procedure for learning the equi-membership surfaces and 
for generating a set of discriminate functions for two pattern classes 
is given. 




Fuzzy Sets 

The concept of fuzzy sets was first introduced by Zadeh [1], 

Since we will be interested in fuzzy sets only with respect to pattern 

recognition, we will define our concepts in Q = E n . 

Definition 1: A fuzzy set A in fi is characterized by a membership 

function [0,1] with the value of f at x representing the 

"grade of membership" of x in A. 

As an example of a fuzzy set in e\ let A be the set of all numbers 

"much larger" than 14. One can give a precise characterization of A by 

specifying f A (x) on E 1 (eg. f A (-l) = 0, f A (1000) = .2 , 
f A (10 ) = .5 , etc.). It should be noted that this characterization 
is subjective. 

Definition 2: The union of two fuzzy sets A and B is a fuzzy set C, 

written C = A u B, whose membership function is given by 

f c (x) = Max[f A (x) ,fg(x) ] 

for x £ . 

Definition 3: The intersection of two fuzzy sets A and B is a fuzzy 

set C, written C = A n B, whose membership function is given by 

f c (x) = Min[f A (x) ,fg(x) ] 


for x e . 



Definition 4: A fuzzy set A is convex if and only if the sets 

defined by 

T a = ^ x l f A ^ x ) 1 

are convex for all a e (0,1]. 


T 


a 


Definition 5: A fuzzy set A is bounded if and only if the sets 

T a = ^ x l f A ( x ) 1 

are bounded for all a > 0. 


Definition 6: 
defined 


The maximal grade of a fuzzy set A, 


written M. is 
A 


M A = SU P n f A (x) 
x e n 


Theorem 1: Let A be a bounded fuzzy set. Then there is at least one 

point Xq e at which is essentially attained in the sense that, 

for each e > 0, every spherical neighborhood of x Q contains points in 
Q(e) = (x|f A (x) > M a - e} . 

Definition 7: The core of a bounded fuzzy set A, written C(A), is the 

set of all points in at which M A is essentially attained. 

Definition 8: Let A and B be two bounded fuzzy sets and H a 

hyperplane. Let e IPs such that f A (x) £ 


on one side of H 



and 


and ° n t * ie ot ^ er side of H. Set = inf 

= 1 “ Mg • is called the degree of separation of A and B 

by H. The degree of separation of A and B, denoted D, is defined 
as D = 1 - M where M = inf^ . 

Theorem 2 (Separation Theorem) : Let A and B be bounded convex 

fuzzy sets. Set C = A n B. Then D = 1 - M (where M_ is the 

v C 

maximal grade of C) . 


Note that Theorem 2 says that the highest degree of separation of two 
bounded convex fuzzy sets A and B that can be obtained with a 
hyperplane is 1 - M . 

The above definitions and theorems are contained in Zadeh's paper; they 
do not exhaust all of the material contained there. Wee introduces the 
following definitions. 

Definition 9: A fuzzy pattern class is a pattern class which is 

a fuzzy set. 

Definition 10: A semi-fuzzy set is a fuzzy set A such that 

M a = sup f A (x) = Max f (x) = 1 . 
x 


X 
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Definition 11: Let A be a fuzzy set. The non-fuzzy section of A is 

defined by NFS = (x|f^(x) = 1} and the complete-fuzzy section of A is 
defined by COM = {x|f A (x) < 1} . 

Definition 12: A equi-membership surface of a fuzzy set is a separating 

surface such that points on the surface have equal grade of membership. 


Recognition of Two Fuzzy Sets 

The discussion that follows deals with the situation in which 
there are two bounded and convex fuzzy pattern classes, A and B, to 
be recognized. 

Suppose we have a set X of training samples. Let a e [0,1] and 

define 

l a = U|f A (x) >. a and f B (x) < a} 

and 

Lg = {x|f B (x) >_ a and f A (x) < ot} 

We further assume that a can be selected so that X £ L u L £ = E n . 

A B X 

Note that the separation theorem tells us that the lowest value of a that 

can be selected is M . In practice we seldom know M. „ . 

Aim AnB 

Wee's algorithm is an iterative procedure for searching for 
equi-membership surfaces until the complete set of training samples is 


contained within these surfaces. 



The first step separates the non-fuzzy section and the complete-fuzzy 
section of the training samples for A(B). [Note that this step may not 
be necessary] Separating boundaries are then generated to retain the 
complete-fuzzy section of A(B) . The retained training samples are then 
mapped into £2^ = E . Separation of the non-fuzzy and complete-fuzzy 
sections of "A"(B) in £2 (as in £2 ) is then determined. The complete 

y 2c 

fuzzy sections of A and B are retained and are mapped into £2 = E n . 

this procedure continues until £2^ is partitioned into two regions. 

The algorithm converges in a finite number of steps. The algorithm 
generates a set of discriminate functions which partitions £2 into 
two regions; generalization to any other point in £2^ is based on these 
discriminate functions. The evaluation of this generalization must be 
based upon experience. 

Figure 1 gives a block diagram of the algorithm. 



Figure 1: Block Diagram of Algorithm 







The training samples X are the input for Transformation Unit (TU) I 
which is a polynomial transformation in many cases. The output of TU I 
is a set Y _£ which is sent (usually) to the general adaptive element 
(GED) . First the GED uses the generalized inverse algorithm (Ko and 
Kashyap's algorithm [3]) to test the linear separability of the samples 
and to find the separating hyperplane. If the samples are not linearly 
separable Widrow and Jioff's algorithm [4] is used to generate a 
minimum mean sequence error hyperplane H: Z T W + W Q = 0 . Note that 

the distance from a point X ± to H is d ± = A’^W + W Q . From the 

samples "close" to II and those erroneously classified, the minimum 
and maximum distances from H are searched in order to obtain two 
parallel separating hyperplanes and . They are as follows: 

H^: X W + Wq - |w|d(max) = 0 
A^W + Wq - |w|d(min) = 0 

The following decision rules are now implimented: 

(1) P £ A if PW+Wq_> |w|d(max) 

(2) P £ B if P W + Wq <_ |w|d(min) 

(3) If P is such that |w|d(min) < P T W + W Q < |w|d(max) , 

send P to TU II. Let Y' represent the set of P's that were not 
classified. Let £ Y'. Then TU II transforms Y^ £ Y ? into 

l £ W z = E * rj:vo of t1ie tyP 63 of transformations used are as follows: 
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(1) Y. . -*• Z. . = a 
1 3 


iffi + W Q | 

Iwl 


|yTw + wJ 

(2) Y -*• Z = exp {-a — 2_} 

1J 1J Iwl 


The set Z of Z^'s is then sent to the GED and the process continues. 
(We remark again that the process terminates after a finite number of 
transformations . ) 
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ABSTRACT 


The purpose of this paper is to discuss the properties of a linear 
discriminant function for the case of arbitrary distributions with 
equal covariance matrices. Using two examples, a comparison is made 
showing how the difference of the means relates to the covariance 


matr i ces 



In the solution of recognition problems the linear discriminant 
function LDF of the form 

d(x) = x'L (jJj - p 2 ) - ^(jJ t + >J 2 )'21 ' (>J| - ,u 2 ) 
finds wide application, where the vectors in the n-dimensi ona I space £1 
of the recognized object, the mean values, and the general covariance 
matrix of the distributions in question are denoted by x, Ui, U 2 and *2. 
respectively. The method of application of the LDF consists in deter- 
mining the membership of the object x in the first class if xfeRj = 

{x | d(x)^0^ and in the second class if xeft- R|. 

The problem is to carry out the discrimination process efficiently 
in the case of incompletely known distributions, for identical covariance 
matrix j ■ ^ = T_ , since in practice the test of norma I i ty of multi- 

dimensional distribution is rarely made. I f c< = (p| •• 

the interclass divergence, then the bound oh p(K), the probability of 
mi sc lassi f i cation , is given by 

(•) p(<X) - u 2 )'2’ 1 (u ( - u g ) + l] “ l 

for the upper limit and 0 for the lower limit. (For the proof we refer 
the reader to 1.6j ). 

If the recognized object x comes from the one dimensional space, 
then the relation between the distance between u ( and u^ and p(o() 
can be easily computed. In order to obtain p(o()<t tor some 6 >0 

(2) \u, - u 2 |22|X| i ( I -t/tf 

Thus in order to compare two different problems with given covariance 
matrices, consider the following numerical example. 


2 


Example I. 

Let ^ ( = (h)t *]> = (i:) and 6 = l/lO. From the equation (2) 

we obtain |uj - u^]"^.^ for but|V ^ -y^>3 for Z ^ ' n order 
to have the maximum probability of mi6classi f ioation less than or 
equal to € 3 l/lO. Note that in each case the irrirer-class divergence 
is 36. The Figure-1 describes this example graphically. 








Figure-1 




In the case of multidimensional space, from the equation (l) 


our scheme in comparison of two problems with given covariance 
matrices is quite obvious, let o( = (u^ - u^) 1 ^ (uj - u^)» then 
this equation gives the ellipsoid in the principle axes plane with 
the length of the ith principal axis 2 J\.o< , where Xi , A*. ( - , \ n. 
are the eigenvalues of ^ . Hence as long as the difference of the 
two means yj| and lies on this ellipsoid, the interclass divergence 
will be constant and so the upper limit on the maximum probability 
of m i sc lassi f i cat i on remains constant also. It is clear that the 
shape of the ellipsoid depends of the covariance matrix. The 
dependence of the function p on the magnitude of the degree of 
divergence of the classes o( is shown in Figure-2, The curve denoted 
bv p shows the relation in the case of normal distributions. 



Figure-2 



Eva luation 


For arbitrary interclass divergence^ the maximum probability 
of misclassif ication of any classes using LDF with unknown Uj, u^ 
and !L is greater than the corresponding probability calculated for 

multidimensional normal distributions with the same u . u 0 and 51 . 

I 2 

However, the maximum value of the probability of misclassf ication 
is a decreasing function of (X and tends to 0 as o(->w , The lower 
limit of the probability of m i scl af i cat! on for arbitrary c< is equal 
to 0, which signifies that cases may be encountered even for small 
where the LDF constructed will classify without error. Forc*>i 4 . 
the probabulity of misclassf f ication is always less than i.e., 
in these cases classification bu means of LDF will always be better 
than random classification with equal probabilities of assigning 
the objects to the two classes. For Oi c the maximum probability 
of mi sc lassi f icat i on for the two classes is greater than g-, which means 
the operation of the LDF may be poorer than random classification. 



a- s 
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In this paper we discuss linear programming and linear programming like 
techniques as applied to pattern recognition problems. Our method will 
be to summarize three relatively recent research articles on such appli- 
cations. In particular, we summarize the main results of each paper, 
indicating the theoretical tools needed to obtain them, and we include a 
synopsis of the author's comments with regard to the applicability or 
non-applicability of his methods to particular problems, including compu- 
tational results wherever given. For more detailed information on the 
methods mentioned here or other such techniques, the reader is referred 
to the particular research article of interest. 

The basic problem considered in all three papers is the following: 

Given two sets of patterns A and B (we consider each pattern as a point 
in E n - Euclidean n-space) , does there exist a surface in E n which 
separates A dnd B? That is, does there exist a surface in E n such 
that all the points of A lie on one side of the surface and all the points 
of B lie on the other side? A special, but much studied, case of the 
above question is: Does there exist a plane (hyperplane) in E n which 

separates A and B? 

The paper is appropriately divided into three sections, one for each 


article. 
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1. Linear and nonlinear separation of patterns by linear programming. 

Let A and B be two sets of patterns, the set A consisting of 
m patterns, the set B consisting of k patterns, where each pattern 
consists of n scalar observations. Assuming that each pattern represents 
a point in E n , we wish to determine a surface in E n that separates 
A and B . 

The author of this article, 0. L. Mangasarian, considers two methods 
of attempting to separate A and B and states that a generalization of 
his second method can be made. In particular, Mangasarian attempts to 
separate A and B by: 

(1) linear separation (by a plane); and 

(2) a quadratic surface. 

We now give a summary of the theoretical details and development of the 
algorithm. 

A pattern will be a row vector (x^ , . . . in E n , each entry x_^ 

called an observation. We represent a set A containing m patterns as 
an m x n matrix, each row of which represents a pattern in A. . Using 
this notation, our problem is to determine a surface in E n such that if 
the m rows of the matrix A and the k rows of the matrix B are 
considered as points in E n , then they fall on opposite sides of the 
surface. Mangasarian states and derives his results for the linear sepa- 
rability case and states two of the corresponding results for the quadratic 
case. We follow his lead and restrict ourselves to the linear case. 

Thus, we wish to determine a single plane 
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xd - y = 0 


CD 


where d is an n-dimensional column vector of real numbers, and Y is 
a scalar (real number) such that 


Ad 

o 

A 

>- 

0 ) 

1 

(2) 

Bd 

o 

V 

>- 

1 

(3) 


where e ancf 5. are respectively m- and k-dimensional column vectors 
of ones. 

We now make the following definition. 


Definition . Two sets of patterns A and B are linearly separable if 
and only if there exists some d,Y such that (2) and (3) are true. If 
no such d,Y exist, then A and B are said to be linearly inseparable . 

Lpmma 1 . A and B are linearly separable if and only if there 
exists an n-dimensional vector c of constants and real numbers a and 
3 such that 


> 

0 

1 

a 

IV 

o 

(4) 

-Be +^3^0 

(5) 

a - 3 > 0 

(6) 

f > c s -f 

(7) 


where f is an n-dimensional column vector of ones. 

Now, if a - 3 is considered as the objective function of the linear 
programming problem with constraints (4), (5), and (7), we have the fol- 
lowing theorem. 
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Theorem 1 . Necessary and sufficient conditions for linear separability 
of A and B is that 0(A,B) > 0 where 0(A,B) is the solution of the 
linear programming problem 

0(A,B) = max „{a - @| subject to the 

c jQt , p 

constraints (4), (5), and (7)}. 

Corollary 1 . Necessary and sufficient conditions for linear insepa- 
rability of A and B is that 0(A,B) = 0. 

(It should be remarked that the author suggests two possible approaches 
in case A and B are linearly inseparable. 

(i) A technique of eliminating points of A or points of B so 
that those points remaining are linearly separable. 

(ii) A technique which uses a finite number of planes to separate 
A and B . ) 

Mangasarian then invokes the duality principle of linear programming 
[8; p. 71-74] to obtain the analogues of theorem 1 and corollary 1. He 
uses the latter analogue to obtain the following condition, which is simi- 
lar to a condition of Highleyman [12] and Nilsson [22], It is an immediate 
way of determining linear inseparability, according to Mangasarian. 

Theorem 3 . (Dual Inseparability Criterion) . Necessary and sufficient 
conditions in order that the sets of patterns A and B be linearly in- 
separable is that the system 



A'u - B'v = 0 


e’u = 1 

JH'v = 1 

u > 0 
v > 0 

has a solution, where u and v are m- and k-dimensional column vectors 
and the prime denotes transpose, (e and J? are as defined previously.) 

Although the author does not present any computational results for 
his method, he does make comments regarding its usefulness. He says that 
the most widely used method for nonparametric pattern separation is 
Rosenblatt's error correction procedure [26], [27] for linear separation 
or a modification of it. [10]j[21]. This method is based on a very simple 
iterative procedure. One advantage of this method over his is its 
simplicity. Its main disadvantage seems to be its inability to determine 
inseparability of pattern sets when it occurs. This is a consequence of 
the fact that the error correction procedure converges only when the pattern 
sets are separable, a fact which is not known a priori. Since it is possible 
to construct some simple examples for which the error correction procedure 
converges very slowly, the problem of distinquishing between slow convergence 
and nonconvergence may be a difficult one. Another advantage of his tech- 
nique, Mangasarian says, is that it can readily be extended to separate 
two sets by more than one plane or surface. 



2. Pattern separation by convex programming. 

The basic problem considered in this paper by J. B. Rosen is the same 
as that of section 1. However, the approach to the problem is different 
and perhaps more complicated. Computational results are included; some- 
thing lacking in Mangasarian ' s paper. 

We summarize the techniques presented in the paper. Suppose that 

are sets of patterns (point sets) in E n . We wish to partition 
E n into regions such that each region contains at most one of the A^. 

The author considers two techniques. 

(i) Given two pattern sets A^ and A^ , the author shows that in 
order that A^ and A^ be linearly separable it is necessary and suffi- 
cient that a certain convex quadratic programming problem be solvable. 
Moreover, if A^ and A ^ are linearly separable, then the author deter- 
mines the distance between A^ and A^ and constructs the unique hyper- 
plane which determines this distance. Extensions to k pattern sets are 
given. 

(ii) The second technique or problem which the author considers is 
that of enclosing one pattern set in a "minimum" ellipsoid. Rosen defines 
what he means by "minimum" and shows that such an ellipsoid is unique. 

In the last section of his paper, Rosen gives computational results 
achieved on certain problems. 


The theoretical details of Rosen's paper are somewhat more complicated 


than that of section 1. We summarize these details here, again omitting 



proofs as in section 1. 


For linear separability, the ideas are similar to those of Mangasarian, 
except that Rosen uses convex programming rather than linear programming 
to determine linear separability. 

By a convex programming problem , Rosen means the minimization of a 
convex function subject to linear constraints. Given two point sets 
P and we say P^ and P^ are linearly separable if and only if 

there exists a hyperplane (plane in the terminology of section! 1) 

H = H(z,a) = {p e E n |p'z = a} such that P^ and P^ lie on opposite 
sides of H, where z is an n-dimensional column vector in E n , a is 
a real number, and ' denotes transpose. Through a series of substitutions 
and generation of equivalent problems, the author proves the following 
theorem. 


Theorem . P^ and P^ are linearly separable if and only if the con- 
vex quadratic programming problem 

a = min {1/4- 2" =1 y^|Q-[y ^ e x ; -Q^y S e 2 ? 


has a solution. If P-. and P„ are linearly separable, then the dis- 

X 

tance 6 between them is 6 = l/v^T and a unique vector y_ = ( ) 

° B 0 

achieves the minimum O. The separating hyperplane is given by H(Xq,3q)= 
{p e E n | p'x 0 = g Q } 


Although it is somewhat detailed, 
order. 

Let P^ be a point set; that is, 
each pattern as being a point in 


an explanation of the notation is in 

a set of patterns. We think of 
E n , where 
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p ij 



a 

2j 


a 

nj 


Suppose that E^ has m^ elements and write as the matrix whose 

j tl1 column is P]^ j • Thus P^ is an n x m^ matrix. Similarly for P 2 , 
another point set. The distance 6 between P^ and P^ is Euclidean 
distance; Rosen claims that this distance will be the maximum value of y 
(real number) for which a hyperplane H(z,a) exists such that 


Pp > (a + 1/2 y) e x 

P^z < (a - l/2y) e 2 

|| z || = 1 (Euclidean norm) 


where ' denotes transpose and e^ and e 2 are m^- and m 2 ~dimensional 
column vectors of ones. 

Letting z = x/||x||;a = 8/|)xj|; y = 2/||x|| , and arriving at an equivalent 
problem to his original one, the author makes the following definitions: 


y = 


( X ) 

V 


( _p for each j = 1,. 


’ m l 


2j 


2i 

( _^) for each j = 1 , , 


> m 2 * 



where is the number of elements p ^ in P^. (Note that y,q^ 

and are (n+ 1 ) -dimensional vectors.) 

Finally, define and to be the (n+l)xm^ and (n+ljxn^ 

matrices (respectively), whose j *"^ 1 columns are q^ and q2^ (respec- 
tively). Thus, we have the notation of the theorem. 

Rosen then shows that if P^ and P2 are "linearly separable, then 
basic subsets ^l — » ^2 — ^2 can c ^ osen suc h that: (i)P^ and P2 

determine the the same separating hyperplane as P^ and P2; (ii) the 
distance between P^ and P2 is the same as the distance between P^ 


and P2; and (iii) P^ and P2 have the property that removing one or 
more points from either P or P 2 results in an increase in the dis- 

tance between them. The author then generalizes his results to the 
case of k pattern sets, k a positive integer. 

For the ellipsoidal separation (nonlinear separation) , Rosen wishes 
to enclose a pattern set in a unique ellipsoid of "minimum" size. He 
achieves this by minimizing the sum of the squares of the ellipsoid's 
semi-axes. This is shown to be equivalent to the problem of minimizing 
the trace of a certain set of matrices. The author proves that such an 
ellipsoid is unique. Rosen then describes an iterative technique of 
determining this "minimal" ellipsoid. The procedure is to alternatively 
solve two convex programming problems, each of which involves the min imi - 
zation of quadratic forms. Finally, Rosen shows that this procedure 
converges to the unique solution. 

The author is quite detailed with regard to computational results 
of his techniques and in suggestions for overcoming computational problems. 



We will not detail these here. Computational techniques and the corre- 
sponding computer programs have been developed for each of the two 
methods presented by Rosen ([9], [25], [6], [23]), and computational 
results for particular problems are given. (see [6], [23]). Computer 
times seem quite good, although the size of the problems Rosen considers 
in his computational work may account for this. Finally, Rosen makes no 
comparison of his techniques with others. 

3 

3. Pattern classifier design by linear programming. 

This paper by F. W. Smith is probably the most detailed of the three 
papers reviewed, as far as examples and computational techniques and re- 
sults are concerned. Smith considers the same problem as that of the 
previous two sections. However, his work is almost exclusively for the 
linearly separable case; only brief mention is made that his techniques 
extend to the linearly inseparable case. 

Smith's approach to the problem differs from that of the previous 
two in that he attempts to determine the separating hyperplane subject 
to the minimization of the mean error function. [15], [16]. Two types 
of the fixed-increment adaptive method ; namely, the steepest descent 
design method [15], and the one-at -a -time design method [15], [17], 

[22] are considered. Both of these methods are iterative type techniques. 
The author formulates this approach (that is, minimizing the mean error 
function) as a linear programming problem and then compares this formu- 
lation with the two previously mentioned fixed-increment adaptive 
methods. Computational results, suggestions for handling special types of 
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problems; suggestions for overcoming computational difficulties, etc. abound 
in the paper . 

We briefly summarize the author’s, approach to the problem. Smith's 
formulation of the problem as a linear programming problem and his many 
comments and suggestions for special cases made in doing this are too 
detailed for the purposes of this report. 

Let A = {Yj- , . . . ,Y^} ; B = be two sets of patterns. As 

in sections 1 and 2, each and Z ^ is considered as a point of E n . 

We wish to find a We E n and a real number d such that 


Y^i H and -z£ W £ d Vk 


a> 


(Smith calls d a scale factor 117], which for the purposes of this paper 
was taken to be 1.) 

The mean error function , h, is defined by 

h “ Z k = 1 k\ + Z k = K+l k\ 
where h^ is the pattern error function associated with 


Y^, if k = 1 , . . . ,K 


and associated with 


Z k _ K , if k = K+l K+M, 


and is a weighting coefficient for each k. 

For the fixed-increment adapter method h^ is defined by: 
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h k = -Cx£w - d) if x£w < d 


= 0 


if x£w > d 


n 


where W is an n-dimensional column vector of E and 


X = Y. for i = 1,...,K 
i 1 

X R+1 = -Z 1 for i = 1,...,M. 


Note 

that if W e E n 

and if 

W is such that 

xjw.d 

for each k, then 


0 for each k. 

Thus , 

h = 0, and W 

satisfies 

Cl). 


Each of the two techniques with, which Smith compares his method 
are initiated by choosing an arbitrary Cbut Smith suggests it can be 
well chosen) W. One then proceeds by incrementing the initial W, 
subject to the criteria of minimizing h. The main content of Smith's 
paper is the detailing of the formulation as a linear programming pro- 
blem the problem of determining W subject to the criteria of minimizing 
h. 

The author's primary comments on computational results are comparisons 
of his linear programming technique with that of the steepest descent and 
one-at-a-time design methods. He is quite detailed on this, giving: 
conjectures for when one method is better than another; calculations for 
the computer time required for a given, but arbitrary problem; suggestions 
for methods of handling certain types of problems, as well as computational 
results with time and accuracy comparisons for the three techniques. 

The author also gives suggestions on how to eliminate some of the 
elements in the pattern sets in order to reduce computer time, but still 
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arrive at the same, or nearly the same, W as one gets using all the 
patterns . 

Finally, the author comments that he thinks his techniques should 
extend to the nonseparable case; however, all detailed computational 
results are for the linearly separable case. 


While it is not our purpose to judge the merits of these linear 
programming type approaches with regard to the pattern recognition pro- 
blems of MSC and NASA, some comments can be made. 

While a nonstatistical approach, to the pattern recognition problems 
of MSC and NASA is somewhat questionable, there may still be some partial 
utilization of such an approach.. 

An application of theorem 3 of section 1 might be useful for consider- 
ing pattern sets that one suspects to be linearly separable. Mangasarian 
claims this to be an immediate way of determining linear separability. 

The techniques suggested in section 2 have the advantage over those of 
section 1 in that commuter programs have already been developed for them. 
The idea of enclosing a pattern set in a minimal ellipsoid is applicable 
in the linear inseparable case and perhaps would have application in, 
at least, special problems. The approach suggested in section 3 is dif- 
ferent than those of sections 1 and 2, and appears to perhaps have more 
potential than the first two. Computer programs have also been developed 


for this technique. 
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INTRODUCTION 


A cluster seeking technique is a method o.f dividing 
data into subsets, called clusters. These clusters contain 
data points that are "similar" to each other and "different" 
from the elements of other clusters. The methods for 
determining the clusters differ in a variety of ways. 

Basically these methods all stem from the inadequacy 
of the most commonly used statistics (the overall mean, 
covariance, and correlation) when the distribution is 
non-Gaussian. It is relatively easy to construct data 
sets which, when plotted, appear quite different but whose 
covariance matrices, for example, are identical 
Moreover, the classes into which it is desired to sort 
data are usually those established by human perception, 
and it has been argued that the usual statistical descriptors 
have little perceptual significance 

Notation ! 

In the sequel, will denote the j-th data vector or 
pattern. N will be the total number of patterns. If the 
patterns are members of a finite dimensional vector space, 


* Bracketed references refer to entries in the 
bibliography. 


D will denote the dimension and X^(i) will denote the 
i-th component of as a member of Ep. will denote 

the similarity coefficient between the i-th and j-th 
patterns, and d. . will denote the "distance" (not necessarily 
a metric) between them. 

Since the measure of similarity is crucial to all 
the cluster seeking techniques, some of the various measures 
that have been used are summarized in Table 1 

Some of the algorithms may be applied with any of the measures, 
while others are more specific. 

The various cluster seeking techniques have been broken 
down into seven categories i 

1. Probabilistic 

2, Signal Detection 

3. Clustering 

4. Clumping 

5. Eigenvalue 

6. Minimal mode seeking 

7. Miscellaneous 

In the following sections of this report, each category 
will be described and one or more algorithms of that type 
will be presented. 



TABLE 1 


MEASURES OF SIMILARITY 


Dot Product! 
Similarity Ratio i 


Weighted Euclidean 
Distance! 

Unweighted Euclidean 
Distance i 



Distance! 


Component Correlation! 


Normalized Correlation! 
Coefficient of Correlation! 


S ij 

R ij 

S ij 

d ij 

d ij 


= X i, X' ] 

= X i# X^ 

= R i,/(Rii+R. .-R ii ) 


-log S i . 


£ w,(X i (k)-X^(k) ) 2 

k=l K 


d,. = | (X i (k)-X’^ (k) ) 2 
k=l 

d ii = £ lx i (k)-X j (k)l 

k=l 

S ii m J t &1&- | X i (k)-X*^ (kjQ 

6-\x x (l)-X^(l)l] • 
Cl-2\x i (l)-X j (l)l^ 
where r^ is correlation 
coefficient between components k & 1. 

S L . = X i *X^/ (X i ‘X i )(X« 5 *X' j ) 

s ij = ’J 1 (xi(k) - u k )(x 5 (k )- u k) 

lcfe 1 ( xi <k)-u k )' i 1 ( .i 1 (xj(k )- Ulc ) 2 

where u^ is the overall mean of 
the k-th component. 
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TABLE 1 (Continued) 

Coefficients of Association* For binary data, n will denote 
number of, a capital subscript denotes and a small 
subscript denotes *0'. 

U n JK /(n JK +n Jk +n jK ) 

2 - <n JK +n jk ,/D 
3. n^/D 

**' 2n JK /,(2n JK +n Jk' t ' n jK ) 

5. 2 ( n JK +nj k )/( 2 (n JK +n^ k ) +n Jk +n ) 

6 - n JK /(n JK + 2 (n Jk +n jK ,) 

7. < n J K +n jk^ (n JK +n jk +2(n Jk +n jK )) 

8 - n J K / (n J +n K- 2n JK ) 

9 - (n JK +n jk )/(n Jk +n jK ) 

10. i( (njK/njJ+fnjK/nK^'Ojl/ 11 : ) 'f'jl/ n k ) ! 

11* ?( ( n JK / n j )+( n JK / n k ) ) 

12 0 n JK^ n J n K 

13 • n JK n jkA^ 1 J n K n j n k 

14. (njK^jk^Jk” 11 ^^ 0 

1 5 . ( n JR n j k -n JR n )/ ( n ^n ^ k +n Jk n j K ) 

l6 - <n JK"jk- n jK n Jk ) / (n J n K n j n k )i 



PROBABILISTIC 


Probabilistic cluster seeking techniques are primarily 
analytical studies. The probability of occurance of a pattern 
is estimated and then a weighted combination of patterns 
is used to estimate probability distributions. 

The following algorithm developed by Fralick is typical £22^ 
Suppose there are M possible classes w^ f ...*w M * and 
associated with each is a conditional probability density 
p(X/w^) which is known except for a single parameter 
that is, assume p(X/w^,©^) is known. Assume also that the 
a priori probabilities of occurence p(w^) are known, that 
the a priori distribution of 0*, P 0 (©*) is known, and that 
can assume only a finite number of values. Then the 
desired density can be determined as follows t 

paw =5 p( w»i 

where 

i} * p k-l ( ® i)# 



,« i )p k (© i ) d© 1 

ptW el >p (w i> + jfiPk-i ( V >, j ) p (,, j ) 

£p k _ 1 <V"j )pl ’ , 3 ) 


For the case of an unknown signal in noise, he proves 
that P k (X k+1 /w i )->p(X/w i ), However, the amount of computation 
and storage required is considerable, particularly for multi- 
variate problems. Moreover, in the case where the class 
a priori probabilities are all the same, the initial selection 


of the probability distributions for the various classes 
must be different for "learning" to occur fa 


& * 



Other probabilistic techniques are discussed in 


_ZT-« 

(17.45.1 6 } 



SIGNAL DETECTION TECHNIQUES 


Signal detection techniques grew out of a desire to 
detect unknown signals in noise. The final decision is 
based on correlation detection to estimate parameters of 
a matched filter. 

The following algorithm of Jakowatz is typical [29^1 

A sample waveform M is stored in the memory of a 
correlation detection device. When the dot product 
b(t) = M(t)»X(t) of the incoming wave X exceeds a threshold 
b T (t), the waveform in memory is modified as follows! 

Let t^ be the time at which memory is changed. Then 
M(t j =(gM(t^ - 1 )e“ <is +X(t i ))/(g+l) where g depends on a 
capacitor ratio, d is a time constant associated with the 
memory device, and s * t-t^_^ for t^t 7 t.^. 

The threshold grows with successful detection and decays 
with failure to detect. 

Other signal detection techniques may be found in 
All of these are primarily used for signal 
detection and as presently conceived their utility outside 
this area seems limited. One severe problem is the use 
of energy detection to start the process going. There is 
a definite thresholding effect for weak signals, and 
apparently a minimal adaptable signal, which may be 
a function of signal waveform. 
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CLUSTERING TECHNIQUES 

Clustering techniques can be characterized by sorting 
of patterns using multiple cluster points* Tentative 
assignments are made to clusters and these assignments 
improved until the centroid of the cluster adequately 
describes the data* Since these techniques vary in a 
number of ways, several algorithms will be presented here. 

Okajima proposed the following algorithm for use with 
electrocardiagram data (43\i 

Step Number Step Description 

1 The data vectors are arranged in random 

order and a bank of memory filters is 
initialized to zero* 

2 The incoming data vector X is selected 

and weighted (if desired). 

3 The correlation X*M/((X*X) (M*M) ) with 

each used memory filter is computed and 

a memory filter M is selected which gives 
the maximum correlation, 

4 If this maximum correlation exceeds a 

predetermined threshold, the filter is 
modified by the rulei If X t is the i-th 
pattern entering the same filter M, then 
M = (l/i)(X 1 +X 2 +...+X i ). 
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5 If not* the data vector goes into a new 

filter. 

6 Repeat 1-5 until all data has been examined. 

The algorithm depends on the threshold, the weighting, 
and the order in which the pattern vectors are selected. 
Algorithms very similar to this have recently been proposed 
using different measures of similarity [^ 0 ,^ 8 , These 
"one-pass" techniques are definitely time-savers • 

Sebestyen is concerned with computing a probability 
distribution based on the sample data [^ 9 , 50 ]. A pattern 
is selected and compared with existing cluster centers. 

The measure of similarity is a weighted Euclidean distance 
with the weight depending on both the component and the 
cluster. The minimum distance of the pattern from a 
cluster point is compared with two thresholds. If the 
smaller threshold is not exceeded the pattern is added 
to that cluster and a new mean for the cluster is computed. 

If the larger threshold is exceeded, a new cluster is formed 
using that pattern as its centroid. If the pattern distance 
is between the two thresholds, the pattern is temporarily 
rejected and will be considered later on in the process. 

This algorithm is computationally complex, and very sensitive 
to the weight factors. 

The ISODATA program of Ball and Hall [ 5 , 6 , 7 ^ has 
recently undergone comprehensive study /27,31 ,32,33T. 



The version presented here is the "final" version 
recommended in 03^ • 


The ISODATA Algorithm 
0. Initialize 

Loop 1, CLASSify and calculate STATistics 
2. Change cluster structure* 

2.1 DELETE 


2.2 If iteration is a split (S) iteration, SPLIT, and 
go to Step 3 I otherwise continue. 

2.3 COMBINE 


3. If iteration is the final one in the SC sequence, STOP* 
otherwise, go to Loop for the next iteration. 

Before the subroutines mentioned above can be explained, 
some notation must be developed* 

SGMAX Maximum standard deviation allowed in a 

cluster, larger than which the cluster is 
split. 

DLIM Minimum distance between two clusters, less 

than which they are combined. 

NCLUSTR Number of clusters at any particular iteration. 

NDATA(I) Number of data points in the i-th cluster at 

any particular iteration. 

NMIN Minimum number of points in a legitimate 

cluster. 

NTOTAL Total number of data points in the input. 




SC sequence Split (S) and combine (C) sequence, 

i i 

u j » Sj Mean and standard deviation of the i-th 

cluster along the j-th coordinate. 


Initialize i Input values for SGMAX, DLIM, NMIN, SC sequence, 
and a starting procedure. The default option 
sets SGMAX = 4.5, DLIM =3.2, and NMIN ■ 20. 

If no starting procedure is specified, the 
SC sequence ■ SSSCSCSCSCSCCC, NCLUSTR = 1, 

®*nd Uj = 0, j = 1 , • » • ,D. 

CLASS and STAT i From the previous iteration there are left 

NCLUSTR cluster centers. The subroutine 
reclassifies the data points to their 
respective closest reference points, using 
distance. The means and standard 
deviations of these new clusters are itera- 
tively accumulated at the same time the 
points are assigned, 

DELETE i This subroutine deletes the existence of a cluster 
when it contains less than a prespecified minimum 
number of points (NMIN), 

SPLIT » This subroutine splits a cluster along the j-th 
coordinate by creating two clusters with centers 
at (uj * ,U£ * , » • • ,Uj * + s , . , »U q^ if 
(i)lts standard deviation along the j-th coordinate 
is larger than SGMAX j and if (ii)lt has more than 



2 (NMIN+1 ) data points. 

COMBINE I This subroutine combines two clusters if the 
distance between themi 

d(u p ,u q ) = ? (1/s . p s . q )(u. p -u. q ) 2 

is less than DLIM. 

Although reasons are given in |3^ for the use three 
different distance measures in the same program (computational 
simplicity), the logic behind mixing x for distance from 
data to cluster, $ 2 for standard deviation of cluster, and 
a weighted $ 2 for distance between clusters, is difficult 
to follow. The user specified thresholds have a great 
influence onto clusters formed, although the iterative 
nature of the algorithm somewhat ameliorates this. 



CLUMPING 


In these techniques a single pair of patterns is selected 
as a nucleus for a clump of patterns. Other patterns are 
assigned to this clump on the basis of the similarity measure. 
Genrally speaking, these techniques require the calculation 
of all pairwise similarity coefficients, forming a similarity 
matrix, and some of these must be recalculated after each 
new combination. 

Several "clustering by linkage" techniques have been 
suggested jj9, 52,53^. All involve first calculating a 
similarity matrix. The nucleus of a cluster is established 
using those two patterns with the highest similarity 
coefficient. Then patterns are added to this nucleus one 
at a time. Single linkage calls for admitting a pattern 
if its similarity coefficient with any one member of the 
cluster exceeds a threshold. Iterative improvement is 
provided by recalculating the mean similarity both within 
groups and between groups. Complete linkage requires that 
a pattern joining a cluster must have a value above the 
threshold with all members of the cluster. If there is a 
choice, it should be made first to give the larger group, 
second to have fewest residual patterns, and third to give 
the highest average similarity coefficient,, After each 
iteration a new similarity matrix is calculated using the 
means of the clusters. Clustering by average linkage 



bases admission on the average similarity of that pattern to 
all members of the cluster. If an admission would lower this 
average similarity by more than .03 (an empirically determined 
value) the pattern should not be admitted. 

Rogers and Tanimoto use a function related to information 

theoretic entropy as a criterion for clustering binary 

patterns Their algorithm is as follows* 

Step Description 
• • 

Compute R. . * X le X^ 

J 

S ij “ R i/ (R U +R ji' R iJ ) 

«i * | (-lo« 2 S lj ) 

J ^ 

R^ « # of pattern vectors j such that 


Step Number 
1 


2 

3 

4 


5 


R i^°- 


Now rank all patterns , first in order of R^, 
and then, for those with equal R^, in order 


of n L . 


Let djj = ■*l 0 §2 S i5 and £° rra distance 


matrix M. 

Let E n (M) = $*(d i ./T n (M))(log 2 (d i;j /T n (M))) 

where T (M) * £( z. d. .) and > denotes 
n ij 

summation of the finite terms of M, after 
repeated rows and columns have been deleted. 

Let g be the number of zeros above the diagonal 
of M and h the humber of infinite terms 
above the diagonal which are not in the same 



row or column as one of the g zeros. Set 
F n (g,h) = log 2 ((n-g)(£)(n-g-l) - h) and 
U n (M) = 1 - (E n (M)/F n (g,h)). 

U n is a measure of heterogeneity. 

6 If U n (M) is near one, clusters do exist and 

the process proceeds by selecting X l0 , the 

highest ranked pattern, and X^ 0 , the second 
highest. 

7 Consider all patterns X J with d. . < d. , 

and determine U for this subset. If 

U is small, add X Jo to the clump and 
recompute U for the larger clump. Continue 
until U takes a large jump, indicating the 
end of a clump. Remove those cases nearest 
the edge and start a new clump. 

Bonner proposes two methods They are both of 

sufficient interest to be presented here. 

The first method involves computation of a similarity 
matrix. This matrix is then "thresholded" by comparing 
each entry with a predetermined constant (eg. .45), If 
the threshold is exceeded, a one is entered in the 
corresponding position in the new matrix. Otherwise a 
zero is entered. This new similarity matrix is then 
manipulated according to the following algorithm* 

CLUSTER I * The similarity matrix is now regarded as a set 
of binary patterns and its similarity matrix 



is formed using the following measure* 

If is the number of ones the i-th and j-th 
pattern have in common, then 

S ij * C ij / ^^ C ii +C o j“ C ij ^ * 

This new matrix is then thresholded. This process 
of taking the similarity matrix of the similarity 
matrix may be repeated as often as desired, hopefully 
until stabilization is reached, 

CLUSTER II * The input here may be the original matrix, or 
the result from CLUSTER I, First "tight” 
clusters are formed in which all members are 
similar and no nonmember is similar to all. Then 
using the tight clusters a set of "core" clusters 
is located in which no object is in more than 
one cluster and all objects in a cluster are 
similar. Finally, a cluster adjustment program 
attempts to build around the cores. 

Algorithm for tight clusters* This algorithm keeps track of 
three things at each level of buildup* 

1, The set of objects (A^ ) in the cluster to this point, 

2, The set of objects (C^) which could possibly be added 
to A^ to further increase the cluster, 

3, The number (L i ) of the last object ^ to be considered 
for addition to the cluster 0 

These three things are stored for each i which is smaller than 
or equal to the present i. Also needed is the similarity matrix 



»here 3^ => 3 \s li( .= l}. 

Step Number Step Description 

1 i = 1, = all objects, A^ * j6 t * 1, 

2 If )( L a^ C^, L^=L^+1 and go to Step 5. Other 

wise continue to step 3» 

3 C i+1 - C ± n S L . . A i+1 - A 1 

4 L i+1 “ L i +1 » 1 = i+1 

5 Is greater than the number of the last 

possible object? If so, go to Step 6, if 
not go to Step 2. 

6 If ® 0 t store as cluster. If not, 

A^ is a subset of a cluster already found 
and so need not be stored. In any event, 

T « A r 

7 i * i-1. If i = 0, STOP. Otherwise, go to 

Step 8 

8 ** ^X J \x i5 fc i and j>L^. Is C^T? 

If yes, go to 7. If not, go to 2. 

Algorithm for core clusters i Let i be the alternative index 

and j the buildup level. 

Step Number Step Description 

Find the tight cluster having the largest 
number of members and store it as the first 
core. Set j = 1. If there is a tie for the 
largest cluster, go to Step 9, 


1 



JZr 18 


i=l 

Find the tight cluster having the most 
members different from the total set of members 
in all stored "core" clusters of alternative 
i of buildup level j. Call this its 
difference set. Call the cluster itself 
a maximum distance cluster. 

If this difference set is larger than that 
of any of the other alternatives of buildup 
level j yet considered, drop these alternatives^ 
consider only the present alternative and 
go to Step 5 . If it is smaller, drop the 
present alternative and go to Step 6. If it 
is the same as that of other alternatives of 
buildup level j, consider all still as 
possible alternatives and go to 5. 

If there is only one maximum distance 
cluster, store its difference set as the 
next core cluster for alternative i and 
go to 6, if there is a tie, go to 8. 

Have all alternatives of buildup level j 
been considered? If so, go to 7 . If not 
i « i+1 and go to 3, 

For any given alternative, are all possible 
objects in one of the core clusters? If 
so, print out the core clusters for all 



alternatives and STOP, Otherwise, 
j * j+1 and go to 2. 

Of the set of clusters in the tie, pick 
the smallest and store its difference set 
as a core for alternative i and go to 6. 

If there is still a tie, go to 9, 

Form a dissimilarity matrix for the clusters 
in the tie, where two clusters are considered 
dissimilar if their difference sets 
contain no common member. Find all the 
tight clusters for this matrix. Each tight 
cluster here will represent a set of the 
original tight clusters whose difference 
sets are disjoint, Store the largest such 
set of difference sets as a set of core 
clusters. If there is a tie, all alternative 
will be followed in the hope that subse- 
quent choices of cores will favor some 
alternatives over others. They are therefore 
added to the alternative list of the next 
level of buildup. Note that it is possible 
that more than one core will be added to 
each alternative by Step 9. By convention, 
this addition is still treated as one level 
of buildup. Go to step 6, 
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Cluster Adjustment Program* Specify a criteria for judging 

a cluster as "large". 

Step Number Step Description 

1 i = 1 

2 j - 1 

3 Consider the j-th member of cluster ii 

Compute from the similarity matrix the 
number of objects in the first large 
cluster to which this j-th object is 
similar. Divide this by the number of 
objects in the first large cluster to 
produce a percentage match of the j-th 
object to the first large cluster. 

Compute such a percentage match of the 
j-th object with each of the large clusters 
and with each of the small clusters already 
considered. 

4 Are any of these matches above some 

threshold (eg. .8)? If yes, go to 5, if 
not go to 6. 

5 Delete the j-th object from the small cluster 

and put it into the cluster offering the 
best match. 

j = j+1. Have all members of cluster i 
been considered? If no, go to 3, if yes 
go to ?. 


6 



7 


i s i+l. Have all clusters been considered? 


8 


9 


If no, go to 2 . If yes, go to 8. 

Iterate this entire procedure as many times 
as desired with the hope that stability 
will be obtained. 

Compute for all remaining pairs of clusters 

C. and C., a measure of their interactioni 
J N. N. 

I 13 - (l/N.N.) i 1 £■> S 
3 x J a=l b=l aD 

where is the number of objects in the 

k-th cluster, S &b = 1 if object a in is 

similar to object b in C^, and S &h * 0 

otherwise. A measure of value for the 

i-th cluster is then 

N t 


'i = % - (1 / n r) r hi 


where N R is the number of clusters other 

than the i-th. For the whole set 

V = ( 1 /Np+l) ^R +1 V. . 

R i=l X 


Bonner admits that this procedure becomes difficult as 
the number of clusters becomes large and wh«n "ties" occur 
frequently in the core building subprogram. He presents 
the following rather ingenious alternative. He states 
that he has a program for this algorithm which can handle 
2000 objects of 360 binary variables each and which averages 
3 minutes of computer time. 

Consider a cluster of N k patterns. Define 



where u. * (1/N) J* X^(i) , 

1 f=l 

where j is the index set 

for the objects in the cluster, 

s. 2 = I (X j (i)-u. ) 2 /(N-l). 

1 j*l 1 

Using a 7^-distribution, calculate the probability P that 

G 2- G k . 

Step Number Step Description 

1 Pick an object to act as a cluster center. 

2 Find the similarity coefficient between 

this pattern and all others. All objects 
more similar than an arbitrary threshold T 
are considered to be in the crude cluster. 

3 Compute the centroid of this cluster. 

Compute the expected number of clusters 
rarer than this to be found in an 
uncorrelated population, as given by 

P. If this number exceeds a preset 
number K, go to 7. Otherwise, "hill-climbing 
will be done in 4. 

4 Find the similarity between the centroid and 

all other objects using the followingi 

Add up the weights ( ^ u i^k** u i) 2 /^ s i 2 / N k) 
of all attributes i where there is a bit 
match between an object and the centroid. 



°k = L << u i>k-“i> 2 /< s f/ N k> 


(u,) k = ( i/N. ) Z xHi) 

K j over i 



If this sum is greater than a certain 
percentage Y of G^, then the object is 
judged as similar to the centroid. All 
objects similar to this centroid are now 
members of the new cluster. 

5 Is this cluster the same as the last? If 

so go to 6, otherwise to 3* 

6 Store the stable clusters as final clusters. 

Delete each member of the cluster from 
consideration as a future cluster center. 

7 Have all allowable objects been used as 

cluster centers? If not, pick one and go 
to 2» if yes, STOP, 

Bonner used both these algorithms on some disease 
symptom data and got similar results. The same results, 
with one notable exception, were also found through a 
standard factor analysis. 

Ward describes an algorithm which repeatedly combines 

those patterns which maximally increase an "objective 

function” go, 61^ . This function is supposed to 

measure the remaining information when two sets are 

united into one (assuming maximal, information corresponds 

to singleton sets). For example, when the patterns are 

grouped into one, he suggests ESS = Jbcix^ -(l/n)( ^X^* ^X^). 

l i i 

It is necessary to know in advance the number of clusters 
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N c to be formed. Let p^^ be the smaller and 

the larger of the two numbers used to identify the 

subsets S(p ,i) and S(q. 1t i) at the i-th stage. Then 

i-1 x ~ x 

s (Pi -> l , i“l) * ^(p^^ijy s(q^_^,i) , and the associated 

objective function is Z (p^ .q^^ »i-l ) . 

Step Number Step Description 

1 k = N 

2 Z »k-l ) * initial value worse 

than all others, i » smallest active index. 

3 j * first active index >i. 

^ Compute Z(i,j,k-1) 

5 Is Z (i, j ,k-l )*> Z )? If yes 

go to 6, if no, go to 7 

^ ’^k-1 ^ ~ 2(i,j,k-l), Pjc-i* 3 1* 

q k-l = 5* 

7 Is j = last active index? If not, set j = 

next higher active index and go to 4. If 
yes, go to 8. 

8 Is i * next to last active index? If not, 

set i = next higher active index and go 
to 3. If yes, go to 9. 

9 Identify the union by p fc-1 and make 

q k-l * nac, t* ve * 

10 Is k * Fy-ftlf so, stop. If not, k * k-1 and 


go to 2. 



Fisher examines all possible partitions on the real 
line and selects that partition which minimizes the 
weighted square distance from the cluster center C 19j • 

For an ordered collection of patterns he proves two lemmas 
that allow him to reduce the number of partitions he must 
consider. He has a program for his algorithm for N £200 
and the number of clusters (assumed known) is less than 10. 
He remarks that even with these size restrictions there 
are still a number of sources of difficulty. 

Sawrey proposes the following for psychological 


data oo. 
Step Number 
1 
2 

2.1 


2.2 


2.3 


2.4 


3.1 


Step Description 
Form the distance matrix 
Select potential clusters* 

Decide on a similarity threshold (eg, 
^(s./2) 2 where s. is the standard 

J J 

deviation of the j-th component). 

Construct a chart of the N patterns, listing 
with each all others that are similar. 

Select as a nucleus any two or more, 
beginning with the largest number of 
similar patterns. 

When a pattern or one similar to it is 
selected, it is deleted from the chart. 
Select dissimilar clusters* 

Decide on dissimilarity threshold 


3 



J 

Construct distance matrix of selected 
patterns for nucleus group. 

Sum all columns, selection beginning with 
the largest. When selected, all patterns 
that are not dissimilar are removed. 
Continue until all are gone. 

Compare and add remaining patterns as 
follows i 

Find centroid of each group. 

Make a chart of all possible additions 
(those that are not dissimilar) 

Find distance between possible additions 
and nucleus. 

Set several thresholds* (l/4)^Sj 2 , (l/3)£s 

U/2)^ 2 , OA)a/. isf. 

Add those patterns closer than the first 
threshold, except that if a pattern could 
be added to more than one group, it should 
not be added to any. 

Recompute centroid, determine the new 
distances, and add those less than the 
second threshold. 

Continue until all thresholds have beer 
used. 



McQuitty has a somewhat more stringent definition of 
a cluster or "type" j^^Q. A type is a set such that every 
one of its members is more like the other members of the 
type than like any other nonmember. In order to locate 
these types, first the similarity matrix M must be 
calculated. The entries of the matrix are then listed in 
order, omitting the diagonal, from the largest to the 
smallest. 

Step Description 

Let T^Tg,... be the types found so far. 

Let , • • • be the categories "expanded" in 

finding the types T^.Tg*...* which have not 

qualified as types. 

1 2 

Let T ,T ,... denote the first, second, etc. 
times a category requalifies as a type. 

Let X and X be the two patterns corresponding 


Step Number 
1 


5 

6 


8 

9 


to the highest similarity score. 

Since S ab > s ay fS by for any other pattern X y , 

X a and X d form a dyadic type T^, 
c (i 

Let X and X be the pair corresponding to 
the second highest similarity coefficient. 

If either X c or X d is X a or X b , assign 


all to C 


1 * 


If not, then X c ,X d constitute T 2 . 

Let X and X be the pair for the next 
highest coefficient. 



10 


11 

12 

13 

14 

15 

16 

17 

not 

£* O' 


_Z'*> 

If X is any of the preceding patterns, 
assign it to the corresponding C^„ 

Rep&ft for X f . 

If X is in one category and Xf in another, 

then neither category can qualify as a type, 

so combine the two categories into one. 
e f 

If X or X is in a category, but the other 
is in neither, then assign both to the 
category in which the one is found. 

6 f* 

If both X and X are in the same category, 
leave them alone. 

If either 13 or 14 occured, the categories 
must be continued. 

If neither X e nor X^" are in the previous 

categories, start a new category C. 

J 

with them in it. 

Repeat for all ranked patterns in the 
order of their rank with all categories 
operative at the time the pattern is 
considered. 

McQuitty claims to prove his method works, but he does 
provide for ties in the ranking. 

Other clumping algorithms may be found in 
44,41,1,2,36,12,35,283. 



EIGENVALUE 


Eigenvalue techniques! unlike the other techniques! are 
noniterative. They depend on calculation of a matrix 
associated with the pattern and determination of one or 
more of its eigenvalues and corresponding eigenvectors. The 
early effort^ in this direction involve estimation of the 
covariance matrix followed by its diagonalization and 
factor analytic techniques (j3, 56* 57 *58^. Since a large 
number of samples are required, especially as the number 
of dimensions increases, the computational aspects are 
formidable. 

Nunnally is in some sense intermediate between the 
clumping techniques and the eigenvalue techniques /~423. 

He constructs a distance matrix rather than the classical 
covariance matrix, but he does use the eigenvectors of 
the matrix to define the clusters. All patterns are 
examined with respect to the eigenvector basis and those 
with which many patterns are highly correlated are selected. 

Cooper /l3,l4,15"\ and Mattson Q?"! both find clusters 
by finding the maximum eigenvalue of the covariance matrix 
and splitting patterns on the basis of correlation with the 
corresponding eigenvector. Both papers are essentially 
limited to the two category case. 

Cooper is more analytic in that he proves, for specific 
distributions, that the hyperplane determined by the sample 



mean and principal eigenvector of the covariance matrix does 

define the optimal partition. However, he does depend 

heavily on a number of assumptions regarding the nature 

of the data. The cases he treats are those in which the 

two cluster distributions arei (1 ) univariate normal with 

the same standard deviationi (2 ) spherically symmetric 

multivariate normal with equal covariance { 3 )multivariate 

normal, either with diagonal covariance matrices or with 

one mean known. He mentions that the analysis for the K 

category case is very complicated, Herehjs: only result is 

that for K spherically symmetric distributions differing only 

in location, the number K can be determined from the 

multiplicity of the smallest eigenvalue of the overall 

covariance matrix. This is interesting in that much 

earlier Young (jS 2^ proposed the dispersion of the 

eigenvalues of the covariance matrix as an "index of 

clustering", and gave a method for determining the number 

of clusters based on this dispersion. 

In a sense, Mattson took Cooper's idea a step further* 

Making no assumptions regarding the underlying distributions, 

he suggests the following procedure* Find A = (a. .) 

N 1 J 

where a^ = ^^(X^(i)-u^)(X^(;j)-Uj), u^ being the corresponding 

component of the mean. Find the largest eignevalue of A and 

the corresponding normalized eigenvector, w. Then use 

J) 

s = X(k)w(k) and a threshold T. If S^T, X is in case 



It if SZ.T, X is in case O a For more than two categories* 
he suggests constructing a "network” of these linear 
threshold elements and using them to produce a binary code 
word for each class. 

For those not mathematically minded, the relaxation 
of assumptions "make the Mattson technique particularly 
useful" as an "excellent example of combination of analytical 
and intuitive approaches" (jj. However, for those 
concerned with rigor, it is unavoidable to wonder at 
the logic of applying the method when the covariance matrix 
is not an adequate reflection of the data (a point which 
Nunnally also raises). 



MINIMAL MODE SEEKING 


Th#g* techniques require categorization information to 
work. A new mode is created only when patterns in one class 
are nearer to a mode of a different class. Pattern density , 
as such, is not used in cluster seeking. 

Firschein [is\ partitions classes into subclasses so thk t 
each member of a particular class is closer, in the sense 
of high dot product, to the centroid of its own subclass than 
to the centroid of any other subclass Unlike previous 
procedures, this method does not require the specification 
of an arbitrary fixed distance as a criterion for membership 
in a subclass, nor is it necessary to specify the required 
number of subclasses beforehand. 

The algorithm begins by setting subclass equal to class. 
The centroid of each subclass is computed. Each vector 
in the first subclass is dotted with every centroid to form 
a dot product arrayi ( a jj) = X 1# u^ = /^components agree-#compo- 
nents disagree. Considering each row in the array, determine 
if the corresponding pattern 

Class I i Has highest dot product with centroid of its 
own subclass. 

Class Hi Has highest dot product with centroid of 
another subclass in the same class. 

Class III* Has highest dot product with centroid of 
another subclass of a different class. 



If Case I, go to next vector. 

If Case II, put vector is subclass with highest dot 
product and recompute the centroids and dot products for 
the revised subclasses. All asterisks (see below) are deleted 
and the procedure returns to the first vector. 

If case III, an asterisk is placed next to the vector 
and the next vector is examined. 

When all vectors in a subclass have been examined, 
the *vector (if any) with lowest dot product with its own 
subclass is chosen as centroid for a new subclass and all 
asterisks are deleted. Centroids and dot products are 
recomputed. Go back to first vector and repreat until only 
Class I vectors remain, or until an arbitrary number of 
iterations has been performed. 

This technique appears useful when the pattern subclasses 
are linearly separable. However, some modification is 
necessary if classes are badly overlapped and intermixed, 

Steinbuch forms subclasses if the distance between a 
pattern arid a mode of its particular class* is greater than 
a fixed threshold The procedure is iterated until 

adequate separation is achieved or other constraints are 
satisfied. He seems primarily concerned with a description 
of the "learning matrix" itself, rather than how it works 
and its limitations. 
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MISCELLANEOUS 


Certain techniques do not fit neatly into any of the 
above categories. 


The technique of Block [_loJ utilizes a high probability 
of contiguous runs of patterns in a time sequence being 
from the same class to adjust the machine to a particular 
mode. This high probability of runs provides marginal 
"teacher information" 

Bledsoe 0D seeks to find the set of hyperplanes passing 
through "corridors" in the data that have maximal average 
distance from the patterns. An arbitrary plane passing 
through the patterns is selected. Distances from this plane 
are computed for all patterns. The average distance of the 
pattern from this plane is maximized by a series of 
iterative adjustments of the plane. This procedure is tried 
for several different initial starting points. The plane 
having maximum average starting distance is selected as 


the best plane. All patterns are projected onto this plane 
and a second plane in D-l dimensions that maximizes 
distance from all patterns is sought. This appears similar 
to the technique of Fu fj23 \ • 

Gengerelli 00 analyzes the distribution of pairwise 
distances between patterns. He defines a cluster as an 



aggregate of points in the test space such that the 
distance between any two points in the set is less than the 
distance between any point in the set and any point not in 
it. First* the distance matrix is constructed. Then, by 
applying a predetermined threshold, it is decided if each 
pair of patterns is a neighbor (assign 1) or a stranger 
(assign 0). This N-S matrix is then analyzed* 

1 • Add each column and augment by 1 • The augmented s um 
is the maximum size of a cluster to which that pattern 
could belong. 

2. Identical columns are eliminated. 

3. Choose the column with the largest and next largest sum, 

4. Consider the intersection of the corresponding row and 
column (symmetry permits row = column). If it is a 1, 
the new column is retained. If not, it is rejected and 
the next largest sum is taken. 

5. Continue, at each step considering the intersection of 
all columns that have been kept. When all columns have 
been considered, the kept columns form the first cluster. 
This is removed and the whole process repeated, 

Hartigan [26^ proposed an "adding algorithm" . The 

idea is to draw a tree of L levels, where each node 

sp 

represents a cluster. The node at any level is the parent 
node of the nodes (descendent) at one level lower and which 
are connected to the node from below. 



The algorithm isi 

1» Initialize the means of the nodes. Set i = 1. 

2. Remove X from tree, modify node means. 

3« Reassign X* to nearest node at level 1, then to the 

nearest at level 2, and so on down the tree till level L • 

sp 

Update node means. 

4, Go back to 2 and repeat until all patterns have been used. 

5. If the process stabilizes, stop. Otherwise, set i * 1 
again and go back to 2, 

This program is very adaptive in the sense that at each 
assignment the statistics of all relevant nodes are accordingly 
modified and updated. It is very easy to trace the kinship 
between clusters by the existence of the tree structure. 
Unfortunately; the end result invariably has a prespecified 
number of clusters equal to 2 Ls P. Big or small clusters 
are indiscriminantly broken up into smaller clusters 
whenever more levels are allowed, Dichotomization of 
patterns contained in any node at any specified level 
(except the lowest) is always carried out. This means that 
patterns which should constitute a single cluster may be 
split and end up in nodes which do not have the same parent, 
making it impossible to identify the true cluster [34I. 



CONCLUSION 


Each of the algorithms described in the preceding 
chapters are illustrated by one or more examples in the 
papers in which they are referenced. In some cases these 
examples are small, rigged cases where the algorithm 
is easy to follow and its accuracy may be judged. In 
other cases, the examples are of •‘real-life" data 
(classification of bees, plants, diseases) which certainly 
give a better feel of the practicality of the algorithm, but 
there is no "absolute truth" against which to examine the 
results. It would be interesting to apply each of the 
algorithms to one or more test cases and compare the 
results. 

In relating and judging the techniques, consideration 
must be given to the similarity measure used, to the 
criterion for a cluster and to the computational complexity 
and amount of memory required. 

The understanding of "convergence" of the methods must 
be regarded as minimal, particularly with nonGaussian data. 
It appears, from the examples, that if the data is indeed 
clustered, then the final clustering will tend to be unique. 
If, however, the data is "smeared" and "amoebic" then a 
greater variety of clusterings can exist. Finally, if the 
data is uniform, then no real stable clusters are formed— 
which is as it should be since no clusters in fact exist fin. 
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ABSTRACT 

The following presents an algorithm for obtaining a solution a to a 
set of inequalities Ax > 0 where A is an N x m matrix and a is an 
m-vector. If the set of inequalities is consistant, then the algorithm is 
guaranteed to arrive at a solution in a finite number of steps. Also, if in 
the iteration, a negative vector is obtained, then the initial set of 
inequalities is inconsistant, and the iteration is terminated. 

Several mathematical errors were encountered. These have been corrected, 
and distinct correct proofs have replaced the original proofs whenever possible. 
When the damage was irreparable, the material was deleted after appropriate 


comments . 



AN EVALUATION OF AN ALGORITHM 
FOR LINEAR INEQUALITIES AND ITS APPLICATIONS 


Let A be a given N X m matrix, with N > m. Find g > 0 and a 
such that J = ||Aor - g|| ^ is minimized. The gradient of J with respect 


to a is 


3 J .T .. a s 

to ' A (*“ - 9) 


Thus 




da? 


a - (A T A) # A T P 
- / A T# A T e 
-A # P 


where A V is the generalized inverse of A. 


From g > 0 and the descent procedure 

g(i+l) - 3<i) + 6P(i) 


where 


, / (Atf(i) - g (i)) . if (Ao^i) - g (i)) . > 0 

6g.(i) is proportional to ^ 1 


if (Aor(i) - g (i)) j s 0 


that is, 6g (i) = p[Aa(i) - g (i) + | Ao? (i) - g(i)|] where p > 0 is a 
positive constant scalar, to be determined later, we obtain the following 
algorithm 


a (0) = A^ g (0) , g (0) > 0 , arbitrary 
define y(i) = Aa(i) - g (i) 

g (i+1) = g(i) + p[y(i) + |y(i)|3 

a(i+l) = A^g (i+1) 

= A # g (i) + pA # [y(i) + |y(i)|] 

= a(i) + pA # [y(i) + | y (i) | 3 • 


( 5 ) 



T - 2 

The algorithm (5) can be rewritten as: 

y (i+1) = A a (i+1) - 3(i+l) 

= A[a (i) + p A # (y(i) + | y(i) | ) ] 

- 0(i) - p(y(i) + | y (i) | ) 

= [A a (i) - 0(i)] + p(AA # - I) [y (i) + |y(i)|] 

= y (i) + p (AA^ - I)[y(i) + |y(i)|]. 

Lemma: Consider the inequalities (6), and the algorithm (5) to solve them. 

Then 

(1) y(i) ^ 0 for any i ( clearly false ) 

(2) If (/) is consistent, then y(i) ^ 0. for any i. 

Proof : (1) is clearly false; consider the case where 

A = [l) , a (0) = (2,2, ,2) T , and 3(0) = (1,1, ,1) T . 

Then y(0) = A a (0) - 3(0) 

- (2,2, ,2) T - (1,1, ,1) T 

- (1,1, ,D T 

> 0 

The "proof" is based on the erroneous "fact" that (AA^ - I) SO. The example 
on page 9 together with the vectors (1,0,-1) and (-2, 0,-1) show that (AA^ - I) 
need be neither positive semi-definite nor negative semi-definite. 

In case y(i) = Aa(i) - 3(i) ^ 0, for the value ~ for p, as suggested 
on page 3, we arrive at a solution in the next iteration: 

3 (i+1) = 0(1) +\ (Aa(i) - 3 (i) + | A a (i) - 0(i)|) = Aa(i) 
a (i+1) = A # A a (i) 

A a (i+1) - 3 (i+1) = AA # Aa(i) - A a (i) = 0. 





Lemma (cont) 

Proof: (2) Assumed i € : y(i) £ 0. Since (i) is consistent, 

3<**» B* > o : A** = g* > 0. 

(7) Then y T (i) 3 <0. 

But A T y(i) = A T [A(a (i)) - 3(i)] 

= A T [AA # B(i) - B(i) ] 

» AW - I) B(i) 

= (A T AA # - A T )3(i) 

= (A T - A T ) B (i) 

= 0. 

Also, A T (y(i)) = 0 (a*) T A T y(i) = 0 

| (a*) T A T y(i) ] T « 0 
=) y(i) T Aa = 0 
=$ y(i) T B* = 0. 

But this contradicts (7) . 

Therefore, if ( i ) is consistent, then y(i) £ 0 for any i. 

Proposition Consider the set of inequalities 

(O A a > 0 and the algorithm (5) to solve them. Let V(y(i)) = ||y(i)|| 2 . 

© & If ( 6 ) is consistent then 

lim V(y(i)) = 0, implying convergence to a solution. 

Note: this proof corrects the error that AV(y(i))" =,/ - |jy(i) + | y (i) | |j 2 {p 2 AA^+(p r p^) i} 

Proof: AV(y(i) ) = V(y(i+1)) - V(y(i)) 

= ||y(i+l) || 2 - || y (i) || 2 

= ||y(i) + p(AA # - I) [y (i) + | y (±) | ] || 2 - ||y(i) || 2 

= [y (i) + p(AA # - I)[y(i) + |y(i)|]] T [y(i) + P(AA # _ I) [y (i)+| y (i) | ] ] 

- [y(i)] T [y(i)l 

= { p (aa # - i){y(i) + |y(i)ll> T y(i)^ 



[y(i)] T {p(AA # - I) [y (i) + |y(i)|]} 

+'{p(AA # - I) [ y (i) + |y(i)|]} T {p(AA # - I)[y(i) + |y(i)|]> 
= p[y(i) + |y(i) | ] T (AA^- I) T y(i) 

+ Py (i) T (AA^- I) [y (i) + |y(i)|] 

+ p 2 &(i) + | y(i) | ] T (AA^- I) T (AA^- I) [y (i) + |y(i)|] 

= p[y(i) + | y (i) | ] T (AA # - I)y(i) 

+ py(i) T (AA^- I) [y (i) + |y(i)|] 

+ p 2 [y(i) + |y(i) j ] T (AA^- I)[y(i) + |y(i)|], 
since (AA^- I) is symmetric and idempotent. 

Note; AA^y(i) = AA^(AA^- I)$(i) 

= AA # AA*3(i) - AA # e(i) 

= AA # g(i) - AA # 3(i) 

= 0 

and y(i) T AA # = (AA # y(i)) T = 0 T = 0. 

pty(i) + |y(i) | ] T (AA^- I)y(i) 

= py(i) T (AA^- I)y(i) 

+ p|y(i)| T (AA y/ - I)y(i) 

= py(i) T AA^y(i) “ Py(i) T I y(i) 

+ p|y(i) | T AA # y(i) ~ P | y <±) [ T I y(i) 

= -p || y (i) || 2 - p|y(i) | T y(i) 

Also p y (i) T (AA^- I)[y(i) + |y(i)|] 

= p y(i) T AA # [y(i) + | y (i) | ] - p y(i) T I[y(i) + |y(i)|] 

= -p y(i) T [y(i) + |y(i)|] 

= -p || y (i) || 2 - P y(i) T |y(i) | 



Also, 


Hence 


y(i) + |y(i)| II 2 = [y(D + | y (±) | ] T [y(i) + |y(±)|] 

= y(i) T y(i) + [y(i)| T y(i) + |y(l) j T |y(i) | + y(i) T |y(i)| 

=|| y(i) II 2 + I y (i) | T y(i) + II y(i) II 2 + y(i) T |y(i)| 

, AV(y(i)) = -p| ||y(l)|| 2 + |y(i) | T y(i) ||y(i) || 2 + y(l) T |y(i)| 

+ p 2 [y(i) + |y(i)| T (AA^- I) [y(i) + |y(i)|] 

= -p ||y(i) + |y(i)||| 2 + p 2 [y(i)+|y(l)|] T (AA # -I)[y(i)+|y(i)|] 

= -p ||y(i) + | y (i) + |y(i)| |) 2 

+p 2 [y(i) + ly(i)|] T AA # [y(i) + |y(i)|] -p 2 ||y(i) + |yU)| II 


Also, [y(l) + |y(i) | ] T AA^[y(i) + |y(i)|] 

= y(i) T AA^[y(i) + |y(i)|] 

+ | y (i) | T AA^y(i) + | y (i) | T AA # | y (i) | 

= |y(l)| T AA # |y(i)| 

= 0, since AA^y(i) = 0. 

Therefore, AV(y(i)) = -(p + p 2 ) ||y(i) + |y(i)| || 2 
Thus, for p > 0, AV(y (i) ) < 0, for all i 

AV(y(i)) = 0 iff y(i) = 0 or y(i) S 0. 


By the lemma, y(i) 
Therefore, AV(y(i)) 



\/y(i) ¥ 0 
if y(i) = 0. 


By Lyapunov's stability theorem for discrete systems, y(i+l) = y(i) + p(AA^-I) 
(y(i) + | y{i) | ) is globally asymptatically stable. 

Therefore, lim ||y(i)|| = 0* 
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Proposition vjjfel: If A a > 0 is consistent, then 

AV(y(i)) = V(y (i+1) ) - V(y(i)) < - X Q V(y(i)) with > 0 
showing exponential convergence. 

The "proof" given was based on the erroneous fact that 
AV(y (i) ) "-" -||y(i) + |y(i)||| 2 {p 2 AA # + (p-p 2 )lh 
Hence the non-zero eigenvalues of the matrix 

C(i) {p 2 AA # + (p-p 2 )I> C(i) 

where C(i) is the diagonal matrix defined by 

c a) = 2 if yj (i) > o 

0 if yj(i) < 0 

are irrelavant to this discussion. 

The "proof" cannot be corrected by using the correct value of AV(y(i))^ 
-(p+p 2 ) ||y(D + | y (i) i ||yAV(y(i)) = (p+p 2 )V(C(i)y(i)) 

< (p+p 2 )V(2y(i)) 

< 4(p+p 2 )V(y(i)) 

and hence AV(y(i)) ^ -4(p+p 2 )V(y(i)) . 

Fortunately, ^ 0 is not only a stronger statement than (]) (&), it is also 
proven independently and correctly. 

Proposition Consider the set of inequalities (£) A a ^ 0 and the algorithm 

(5) to solve them. If (^) is consistent, then a solution 
is obtained in a finite number of steps. 

Proof: Recalling that 3U+1) = 8(1) + p[y(±) + |y(±) I P > 0 we observe 

that 3 is a non-decreasing vector. That is, each coordinate of 3 


is non-decreasing. 
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Thus, choosing 3(0)^ = (1,1, — ,1), every coordinate of 3(i) - 1 for 
all i. 

Since V(y(i)) ) 0, — [ N ^ : i > N V(y(i)) < 1. 

But V(y(i)) < 1 each coordinate of |y(i)| < 1. 

Therefore, A a (i) = $(i) + y(i) > 0, i > N. 

Therefore, a solution to A a ^ 0 is obtained in a finite number of steps. 

Proposition 2: If (^) is inconsistent, then there exists a positive integer 

i* such that 


AV(y(i) ) < - X Q V(y(i)) 

if 

i 

< 

i* 

AV(y(i)) = 0 

if 

i 

> 

i* 

y(i) i 0 

if 

i 

< 

i* 

y (i) = y(i*) ^ 0 

if 

i 

> 

i* 

a (i) = a (i*) 

if 

i 

> 

i* 

6 (i) = (3(i*) 

if 

i 

> 

i* 


Unfortunately, his entire proof is based on the misconception that (after showing 
that AV(y(i)) ^ 0), "since y(i) and hence V(y(i)) cannot become zero for 
any i [since (6 ) is assumed to be inconsistent], there must exist a value 
of i, say i*, such that AV(y(i)) < 0 for i < i* 

AV(y(i)) = 0 for i = i*" 

However, it does follow from part Q) of the lemma, that the verbal explanation 

of proposition 2 is correct: "In other words, the occurance of a nonpositive 

vector y(i) at any stage terminates the algorithm and indicates the inconsistency 
of (6)." This is possible because the verbal explanation is not equivalent 


to the statement of the proposition. 
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Further implications of the existance of i such that 

y(i*) < 0 follow: 

Then y(i*) + |y(i*)| = 0. 

y(i* + 1) = y (i*) + P(AA # - I) (y(i*) + |y(i*)|) = y(i*). 

Similiarly B(i* + 1) = £(i*) + p[y(i*) + |y(i*)|] = £(i*) and 
a(i* + 1) = a(i*) + pA # [y(i*) + |y(i*)|] = a (i*) 


Hence 

y (i) = y(i*) 

for 

i 

> 

i* 


3 (i) = B(i*) 

for 

i 

> 

i* 


a (i) = a (i*) 

for 

i 

> 

i* 

also 

AV(y(i)) - 0 

for 

i 

> 

i* 


This proceedure is compared with other algorithms, but unfortunately, 
this algorithm is "rewritten as": 

a (i + 1) = a (i) + p (A T A) -1 { | Ax (i) - 3(i)| - (Ax(i)B(i))} 

= a(i) + p(A T A)“ 1 { |y(i) | - y(i) } 

3 (i + 1) - 3 (i) + p{ ] A a (i) - 3(i) | + (Aa(i) - B(i))} 

=• 3 (i) + p{ 1 y (i) | + y (i) } 

This is a change from 

a (i + 1) = a(i) + p A^{|y(i)| + y(i)}. 

Clearly these two expressions are not in general equivalent, even if the 
sign for y(i) in the new expression is made consistent with all of the other 
expressions of this type. 

When implementing this algorithm, one standard initialization is to let 
3 (0) be the vector composed of all -KL’s and to let p = 1/2. The latter 



-HT~“ 9 


compensates for the multiple of two arising from [y (i) + | y (i) | ] . 

The algorithm was applied to the matrix used to illustrate the 

numerical computation of a generalized inverse in the MSC Internal Techinal 
Note MSC - IN - 64 - ED6, The Concept of Generalized Inversion of Arbitrary 
Complex Matrices by Henry P. Decell, Jr. For ease of calcualtion, only two 
place accuracy was used. 



Let 3(0) = (1, 1, 1), T Let p = 1/2 , 






A a (0) > 0 


The algorithm arrives at a solution on the zeroth iteration! 



This algorithm has the distinct advantage of being finite, but has no 
bound on the number of iterations. There is a flag signaling the inconsistancy 
of the set of equations, but unfortunately there is no guarantee that the flag 
will occur if the equations are inconsistent. 

There are only two matrices that have to be calculated, namely and 

(AA^-I) , and these only have to be calculated once. This has the strong 
advantage of minimizing both computation time and storage requirements. This 
method has one other disadvantage - the primary iteration does not yield the 
desired vector, so two iterations must be continued concurrently (unless it 
is preferable to store several to many vectors and perform the second iteration 
after is it known that the desired vector exists). This disadvantage is mimimized 
by the similiarity in the algorithms: y(i+l) = y(i) + p(AA^- I) [y(i) + |y(i)|] 

and a(i+l) =a(i) + p A^[y(i) + |y(i)|]. That is, only the vector [y(i) + |y(i)|] 
need be computed, and then used in both of the algorithms. 
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Introduction - One of the important problems in pattern recognition is that of 
feature extraction or selection. Tou and Heydorn (1967) proposed a procedure 
for two pattern classes to find a dimension reducing transformation matrix B 
that maximizes the divergence in the reduced dimension. C.C. Babu (1972) 
extended the above procedure to the multi-class problem by maximizing the average 
divergence in the reduced dimension. Both of the above papers present necessary 
conditions for the divergence in the reduced dimensional space to be an extremum. 
Neither of the papers present an explicit solution for obtaining B, and both 
suggest that B be obtained numerically. Baba's expression for the gradient 
of the average divergence with respect to B is rather lengthy and numerically 
unattractive, since it is expressed in terms of many eigenvalues and vectors, 
which of course must be obtained. Tou's expression, in addition to being 
numerically unattractive, is valid only in the case of two distinct classes. 

In this paper, a comparitively simple expression for the gradient of the 
average divergence with respect to B is developed. The developed Expression 
for the gradient contains no eigenvectors or eigenvalues; also, all matrix 
inversions necessary to evaluate the gradient are available from computing the 


average divergence. 



SECTION 1 - THREE FUNDAMENTAL LEMMAS. 


Let 


B 

; k 

by 

n 

matrix of rank k < n 

A 

; n 

by 

n 

symmetric matrix of rank 

s 

; n 

by 

n 

symmetric matrix 


and define 

ip = j tr{(B A B T ) -1 (BSB) T )} 

We prove the following Lemma 
Lemma 1 

/a^\ T = [sb t - ab t (b A b t ) -1 (b S B T )](B A b t ) -1 
\ 3B / 

Proof: Taking the differential of ip, it is easily verified 

dip = F + G, where 

F = -| tr{ (B A B T ) -1 (dB S B T + B S dB T ) } 

= tr{ (dB S B T )(B A B T ) -1 } + tr{B A B T ) _1 (B S dB T ) } 

= ~ tr{ (dB S B T )(B A B T ) -1 } + j tr{ [ (dB S B T ) (B A B 1 ) -1 ] 1 } 
= tr{(dB S B T )(B A B T ) _1 > 


and 



G - -j tr{ (B A B T 5' 1 (dB A B T + B A dB T ) (B A B T ) -1 (B S B T ) } 
= -j tr{ (dB A B T )(B A B T ) _1 (B S B T ) (B A B T ) -1 } 
tr{ (B A B T ) -1 (B S B T )(B A B T ) -1 (B A dB T ) } 

= -tr{ (dB A B T )(B A B T ) _1 (B S B T ) (B A B T ) -1 } 


thus 

dip = F + G 

= tr{dB[SB T - AB T (B A B T ) -1 (B S B T ) ] (B A B T ) -1 } 


Now, define 


so that 


H = [SB T - AB T (BAB T )“ 1 (BSB T )](BAB T )~ 1 


dip = tr{dBH} 


and 



H > 

ij 



where h . . is the element in the j 
ji 


ith 


Ml 

8b 


, .ith row and 
is the element m the 1 


row and 
ith 


ij 


that 



i^* 1 column of 
column of 8 iJj/8B, 



. Since 
it follows 


Q.E.D, 



Lemma 2 


/j^A 1 

l 3B ] 


Proof: Immediate from Lemma 1. 


Remark 1 - Note that when k = n, so that B is non-singular. Lemma 2 
shows that ip is in variant under a non-singular transformation, ie 


M m o 

8B U 


Remark 2 - If n S 3 and k = n-1, then the column vectors of (d\p/dB) are 

T 

linearly dependent, since by Lemma two, the rank of (dip/dB) is at most 1. 


Lemma 3 : Let Q be a non-singular k by k matrix. Let B = QB. Then 


=0 implies / = 
\ 9B / \3B J 


Proof: By Lemma 1 


Am /SfTi /V "J A Arp A Am _1 

= [SB 1 - AB (B A B ) (B SB)](BAB) 1 

= [sbV - ab t q t (q t ) -1 (b a b t )“ 1 q" 1 Q(bsb t )q t ](q t ) _1 (b a. b t )q“ 1 


/ \t -1 

= ImQ 

\ 9B/ 


= ( 0 ) 


Q . E . D. 



SECTION 2 


B-AVERAGE INTERCLASS DIVERGENCE - A NECESSARY CONDITION FOR AN EXTREMUM. 
Assume the existence of m distinct classes with means and covariances 


U ± n-dimensional mean vector for class i. 


n by n covariance for class i, assumed to 
be positive definite. 


Let <5 . . = U. - U. so that 6 . . 6..^ = 6 <5..^ 

1 J iJ iJ ]i ]i 


The interclass divergence between classes i and j is defined in Reference 1 
D(i,j) = \ trU" 1 ^ + 6 6 ± . T )} + I tr{A^ 1 (A 1 + & ± . 8 ^)} - n 


Note that when A , = A . and u. = u . , 
i J i J 


D(i, j ) = 0 


so that D(i,j) is in a sense, a measure of the degree of difficulty of 
distinguishing between classes i and j , with the larger the value of 

the less the degree of difficulty of distinguishing between classes 
i and j . 

There is a discussion in Reference' 2 of a natural generalization of the 
interclass divergence i.e., the average interclass divergence, defined by 




..m-1 . m 

2 2 
D = ti i=i+i 


= 7 tr{%; AT 1 (? 1 [A. + 6.. T ])} 

2 i=l l j=l 3 13 l] 

m 

_ 1 a- 1^ i m(m-l) 

- 2 tri^A. S.} - ~2 n 


m(m-l) 


n 


where 


S, = 


m 

2 

j=i 


[ V 


ij 13 


We are interested in performing the transformation 


y = Bx 


where 

x ; an n-dimensional observation vector 

B ; a k by n matrix of rank k, with k ^ n 

y ; the k-dimensional transformed observation vector 


It is 
the means 


shown in Reference 3 that corresponding to the transformation 
transforms. 


y = Bx, 


P ± > BP ± 

and the covariances transforms, 

A. > BA.B T 

i i 



Thus subsequent to performing the transformation y = Bx, we can assume the 
existence of m classes with means and covariances 



BA.B T ; 


k-dimensional mean vector for class i 

k by k covariance for class i, which is positive 
definite by the assumptions on B and A^. 


Thus in k-dimensional space, the B-induced interclass divergence D R (i,j), 
is, by definition of the interclass divergence; 


D B (i,j) = \ tr{(BA 1 B T )" 1 B(A. + 6.. 6*.)B T } 

+ \ tr{(BA.B T ) -1 B(A i + <5.. & T ± .) B T } - k 


Similarly, in k-dimensional space, we can define the B-average interclass 
divergence, D R , as 


m-1 m 


D B i=l j=i+l D B (l,:i) 


= | tr{^ [ (BA ± B T ) _1 (BS.B T )]} - 


m(m-l) 


where, as defined previously 

m 

r 

j=l 


S. = ; [A. + 6 . <5- T .] 

i j=l 2 iJ i] 


Note that in performing the transformation y - Bx, the dimension of each 



observation is reduced from n to k, so that in a sense, information is lost. 
It is shown in Reference 2 that a measure of the information lost is given 
by the difference 


c - »B 2 0 


We are interested in minimizing the information lost, as measured by the 
average interclass divergence. Thus, it is desired to maximize the B-average 
interclass divergence, or equivalently, minimize - Dg. We prove the following 
theorem; 

THEOREM 1 - Let a k by n matrix B of rank k extremize Dg. Then it 
is necessary that B satisfy an equation of the form 

T m 

= [s.b t -a.b t (ba.b t ) _1 (bs.b t )](ba.b t ) -1 = 0 

i=l i i i i i 

/S 

Also, if B = QB, where Q is a non-singular k by k matrix. 




1 

V dB / 



So that B is unique up to a non-singular k by k linear transformation. 


D. 


B* 


Proof: Immediate from the definitions of B and 


and Lemmas 1 and 3. 
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Remark 1 - The expression B is the gradient of the B-average interclass 

dB 

divergence with respect to B. Note that the expressions for and 9D /9B 

15 15 

are rather easily evaluated. 

THEOREM 2 - Let B be a k by n matrix of rank k such that BB^ = I, and 
satisfying 

(B T B)S i = S i (B T B) and (B T B)A i = A i (B T B) 

i = 1,2, ... ,m 

/ 

then ( 9D. 

\dB 



Proof : 


By the above commutivity and since 



it is readily verified 


T -l -IT 

(BS ± B ) x = BS ± B 


T -1 -IT 

and (BA.B ) = BA. B 

1 x 


Note that 



can be written as 




i=l 


[S i B T (BS 1 B T )“ 1 -A i B T (BA i B T ) 1 ] (BS j .B T ) (BA^ 1 ) -1 
[b t bb t - b t bb t ] (BS ± B T ) (BA^B 1 )" 1 



Q.E.D. 



Remark 1 - In general, such a B satisfying the hypotheses of Theorem 2 will 
not exist. However, it will be shown in Remark 3 that the hypotheses of theorem 
2 is satisfied when m = 2 and the classes have equal means. Although this 
case has no practical value, it is of interest since here a class of matrices 
which extremize Dg are readily available analytically. 

Note that under the hypotheses of Theorem 2, it is true that 

(BA i B T ) -1 = BA i “ 1 B T 

This is just a special case of the more general result: 

(BA^B 1 ) -1 = BA" 1 B T + BA i _1 (I - B T B)Y 

T 

for some Y and any B of rank k satisfying BB =1. 

T T 

Remark 2: Note that if B satisfies BB = I and if (B B) , S^, and 

A^ (i = l,2,...,m) are all diagonal matrices, then 


/ 


( 



T 


= 0 


An example of a 
selection of k 
with elements b 


T 

B satisfying B B 
out of n channels 
satisfying 


is a diagonal matrix is given by any 

T 

Mathematically, B must satisfy BB 


I, 



Remark 3: Consider the particular case where m = 2 and = 0. Then there 



exists an n by n nonsingular matrix 


P such that 



PA^ 1 


I and 
n 


pa 2 p t = w 2 


where I is the 
n 

Then any matrix B 


n by n identity matrix and W 2 is a diagonal matrix. 

T 2 

such that BB = I and with elements b . . = b.. satisfies 

ij iJ 



= 0 


SECTION 3 - A COMPARISON OF EXPRESSIONS 

The following Theorem is proved in Reference 4, with the notation of 
Reference 4 being changed to agree with the notation of this note. 

THEOREM - If two pattern classes tt^ and tt 2 are normally distributed according 
to N(y^,A^) and N(y 2 ,A 2 ) respectively, then a necessary condition for the 
B- induced interclass divergence D^Ci,!) . to be an extremum is that the matrix 

D 

B satisfy the following equation: 


i-A 1 - ^ 2) ( V T - B iV T)b i b i T 


+ (6 12 4 12 TbT - W/V l Ll 


+ ^ 6 12 6 12 B " g k+2 A 2 B ^ b k+2 b k+2 


where 


3 . and b . 
i i 


are the eigenvalues and eigenvectors of 


(BA 2 B T ) _1 (BA 1 B T ) ; 



$k+l» ^k+i an< ^ ^v +9 are t ^ ie eigenvalues and eigenvectors of 


k+2* k+2 


(BA 1 B T )“ 1 (B ^2 and (BA^V^B & u 


respectively. 

While the above expression is not too complicated, one is still faced with 
the bothersome task of obtaining the eigenvalues and eigenvectors (compare with 
theorem 1) . 

Finally, we present Babu’s condition for the B-average interclass divergence 
to be an extremum (Reference 5). Again, the notation of Reference 5 has been 
changed to agree with the notation of this report. 

m m 

THEOREM - Let a k by n matrix B of rank k extremize D„ = .^~D_(i.-0. 

' B i=l j=l B 

Then it is necessary that B satisfy an equation of the form 


m 


m 

r - - y &a;W] Vj 

m k 

+ A~[. Z :(A.B T - A. .A.B T )e. .e T . = 0, 
i s l ] S 1 l ij i i] ij 


where 


and e, are the eigenvalues and their corresponding eigenvectors of: 


m m 

[^(BA/)" 1 ] [^BA^ 1 ] 


e . . 

ij 


are the eigenvalues and their corresponding eigenvectors of: 


and A . . 
ij 


and 
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and 


A. = 
1 


. /..fit. 

3=1 ij ij 


Again, a comparison of the above Theorem with Theorem 1 suggests the 
desirability of using Theorem 1 to compute the gradient. Note that S_^ and 
(i=l,2, , . . ,m) appearing in Theorem 1 are constant and need to be com- 
puted only once. 

In addition, Babu's expression for T appears to be incorrect. In deriving 
the expression for T, Babu essentially assumes 

m m 

[ i?i (BA i B V' 1] = b/^a: 1 ) -1 b t (d 


(Equations (7) and (12) of Reference 5) to be true for arbitrary B. That 
the above identity is not true in general is evidenced by the following counter 
example; let 



B = a T = (1 1) 


The left side of equation 1 is 

. X— 6 




x- 


The right side of equation 1 is 


2+1 = 1 
3 2 6 


SUMMARY 


It has been shown that for m distinct classes with means VU and 
covariances upon performing the transformation y = Bx where B is a 

k by n matrix of rank k, the average divergence in the space of reduced 


dimension may be written as 


IU 

D„ - i trX(B« bVWb 1 )} - k 


B 2 " i=l v 1 


where 


s i ' Ji [A j + (p i • V (p i ' p j )T) 

Jrt 


3D t 


Also, if d °B denotes the matrix whose i-j th element is B , where b. . 

W db ij 

is the i-j th element of B, then 


[ = X[SB T -AB T (BA.B T )" 1 (BS i B T )](BA i B T )" 1 

\ 8B / ~ i=l 1 


and 
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